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ABSTRACT 


| cess 5.0, the latest industrial revolution, advances the Smart Factory concept 

by emphasizing Human-Machine collaboration and sustainability. It incorporates 
the human aspect into industrial processes, promoting critical thinking, personaliza- 
tion, and adaptability, while leveraging technologies like loT and AI for increased 
efficiency and productivity. However, this era also introduces a complex landscape 
of cyber threats. As machines, systems, and humans become interconnected, ensur- 
ing cybersecurity in smart factories becomes crucial to balance innovation and effi- 
ciency with robust security and privacy preservation measures. In response to these 
challenges, this doctoral research contributes innovative solutions that address the 
security and privacy vulnerabilities inherent in the Industry 5.0 scenario. 

The first contribution of this doctoral research revolves around federated learning 
methodology (FL) for malware detection based on network analysis. This contribu- 
tion introduced a cost-effective and efficient approach to deep-learning-based mal- 
ware detection using FL methodology. This methodology addresses computational 
overhead and privacy concerns by leveraging network traffic data balancing emerg- 
ing technologies with security and privacy to mitigate large-scale malware attacks 
that could undermine Industry 5.0’s core principles. 

The second contribution puts forth a novel privacy-preserving secure framework 
called PPSS, integrating blockchain with energy-efficient Proof-of-Federated Deep 
Learning (PoFDL) consensus protocol to optimize the process of FL in terms of pre- 
serving data privacy, enhancing system reliability, and promoting transparency. PPSS 
adeptly tackles the challenges associated with cyber threat detection and data privacy, 
specifically within the context of resource-constrained and heterogeneous industrial 
systems. 


iv 


The third contribution focuses on developing an efficient, robust, federated cy- 
ber threat detection framework for Industrial IoTs. The approach leverages feder- 
ated learning and generative adversarial networks (GANs) to enhance IDS efficiency, 
privacy protection, and resilience against adversarial attacks. A federated genera- 
tive model was employed for data augmentation to limit the attack surface, thereby 
improving cyber threat detection reliability in the face of zero-day and adversarial 
threats. 

The performance evaluation of the proposed approaches was conducted using a 
new cyber security dataset named Edge-HoTset. Specifically designed for cyber threat 
detection in Industrial IoTs. The results showcase the efficiency and reliability of cyber 
threat detection under various data distribution modes. 

Combining the insights from these contributions, this thesis proposes a compre- 
hensive approach to safeguard Industry 5.0 from cybersecurity threats. Federated 
deep learning techniques optimize the process of knowledge sharing among par- 
ticipants while protecting data privacy in a resource-efficient manner. Integrating 
blockchain-enabled intrusion detection systems ensures the integrity and security 
of data exchanged among IoT-based devices. Deploying generative adversarial net- 
works fortifies the system’s resilience against zero-day and adversarial attacks. 


Keywords : Cybersecurity, Industrial Internet of Things, Blockchain, Federated Learning, 
Privacy-Preserving, Intrusion Detection System, Cyber Threat Detection 
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Résumé : 


La derniére révolution industrielle 5.0 met l'accent sur le concept d'usines intelligentes, en 
mettant l'accent sur la collaboration entre les étres humains et les machines pour la durabilité. Cette 
intégration favorise la réflexion critique, la personnalisation et l'adaptabilité, tout en tirant parti de 
technologies telles que l'ToT et I'IA pour l'efficacité et la productivité. Cependant, cette ére introduit 
également un paysage complexe de menaces cybernétiques. A mesure que les machines, les 
systemes et les étres humains se connectent, garantir la cybersécurité dans les usines intelligentes 
est crucial pour équilibrer l'innovation et l'efficacité avec des mesures de sécurité robustes et la 
préservation de la vie privée. Cette recherche doctorale propose des solutions novatrices pour 
aborder les problemes de sécurité de la vie privée dans le paysage de I'Industrie 5.0. 


Cette recherche doctorale se concentre sur trois principales contributions : la premiére 
concerne l'apprentissage profond (FL) pour la détection de logiciels malveillants basée sur 
l'analyse réseau, la deuxi¢me porte sur un nouveau cadre sécurisé appelé PPSS, qui intégre la 
blockchain avec le protocole Proof-of-Federated Deep Learning (PoFDL) pour optimiser les 
processus FL tout en préservant la confidentialité des données, en améliorant la fiabilité du systeme 
et en favorisant la transparence. PPSS aborde les défis li¢és a la détection des menaces 
cybernétiques et a la confidentialité des données, en particulier dans les systémes industriels 
hétérogénes et gourmands en ressources. 


La troisiéme contribution se concentre sur le développement d'un cadre de détection des 
menaces cybernétiques fédéré, efficace et robuste, spécifiquement con¢u pour les objets connectés 
industriels (IIoT). L'approche exploite l'apprentissage fédéré et les réseaux génératifs (GANs) pour 
améliorer l'efficacité des systémes de détection des intrusions (IDS), la protection de la vie privée, 
et la résilience contre les attaques adverses. Un modeéle génératif fédéré a été utilisé pour 
l'augmentation des données afin de limiter la surface d'attaque, améliorant ainsi la fiabilité de la 
détection des menaces cybernétiques face aux menaces zero-day et adverses. 


Les performances de ces approches ont été évaluées a l'aide d'Edge-IoTset, un nouvel 
ensemble de données concu spécifiquement pour la détection de menaces IoT. Les résultats 
démontrent l'efficacité et la fiabilité de ces approches dans divers scénarios de distribution de 
données. En combinant ces enseignements, cette recherche propose une approche globale pour 
protéger I'Industrie 5.0 contre les menaces de cybersécurité. 
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CHAPTER l 


oI 


INTRODUCTION 


“Security is a process, not a product. Products provide some protection, but 
the only way to effectively do business in an insecure world is to put processes 


in place that recognize the inherent insecurity in the products” 


— Bruce Schneier 


| havea 5.0, the latest phase of the industrial revolution, represents a significant 

shift in the manufacturing landscape that emphasizes Human-Machine collabo- 
ration and sustainability. It builds upon the Smart Factory concept, introducing tech- 
nologies like Cloud computing, the Internet of Things (IoT), Artificial Intelligence 
(AI), and Big Data analytics, and further complements these advances by facilitating 
human intervention when necessary. It leverages critical thinking, personalization, 
and adaptability to enhance efficiency and productivity. 

However, Industry 5.0 also brings about a complex landscape of security chal- 
lenges. The interconnected nature of machines, systems, and humans, combined with 
extensive data exchange, opens the door to a range of cyber threats and privacy intru- 
sions [1]. In response to these challenges, researchers are developing new cybersecu- 
rity strategies to protect privacy and secure industrial networks and control systems 


from large-scale cyber threats. 


Federated learning (FL) has recently emerged as a decentralized and privacy- 
preserving computing paradigm, offering a viable solution to mitigate security and 
privacy risks in IloT environments. FL facilitates the local training of machine-learning- 
based (ML) and deep-learning-based (DL) detection models on edge devices, wherein 
only model updates are shared for global optimization, sparing the transmission of 
raw sensitive data [2]. By adopting this approach, the privacy of sensitive information 
is upheld, fostering a secure environment for data processing. 

Cross-silo federated learning, an extension of the FL paradigm, further advances 
the capabilities of IloT systems by enabling different industrial organizations to ex- 
change intrusion events, incident logs collaboratively, and reported alert data about 
cyber attacks. The participating entities share knowledge and insights through trans- 
fer learning without compromising data privacy, bolstering the collective defense 
against cyber threats. 

Despite the promising advantages of FL and cross-silo FL, the security landscape 
remains dynamic and challenging. Adversarial attacks, including data poisoning and 
inference attacks, have demonstrated the potential to exploit vulnerabilities in IloT 
systems, posing significant threats to the integrity and reliability of model updates. 
Additionally, trace information within model updates may inadvertently disclose pri- 
vate and sensitive data, necessitating robust audit gateways and enhanced security 
measures to thwart potential leaks. 

This doctoral thesis aims to delve into the intricacies of the security and privacy 
challenges that impede the widespread adoption of IloT technologies. By investi- 
gating the potential of federated learning and cross-silo federated learning, the re- 
search seeks to develop novel and practical strategies to enhance the security, privacy, 
and reliability of IloT systems. Through empirical evaluations and rigorous exper- 
imentation, this research endeavors to contribute to the advancement of secure and 
privacy-preserving IloT frameworks, fostering a resilient and trustworthy smart in- 


dustry ecosystem. 


In the subsequent chapters, we will explore our proposed approaches’ theoret- 
ical foundations, methodology, and implementation details. We will subsequently 
present comprehensive analyses of experimental results, leading to valuable insights 
and practical recommendations for industry stakeholders, cybersecurity profession- 
als, and researchers. By fortifying the foundations of IloT with robust security mea- 
sures, we strive to accelerate the realization of the full potential of the smart industry, 


heralding a new era of optimized productivity, reliability, and service quality. 


1.1 Research Questions and Objectives: 


Our research study concentrates on implementing an efficient and effective security 
monitoring mechanism, Intrusion Detection Systems (IDS), to safeguard Industry 5.0 
against emerging cyber threats. To achieve this objective, we will comprehensively 
analyze the vulnerabilities and architectural characteristics of Industrial Internet of 
Things (oT) networks. 

This analysis will serve as the foundation for developing IDS solutions that are re- 
liable, robust, and tailored to the constraints inherent in Industrial IoT environments. 
In particular, Table 1.1 outlines the specific research questions identified to address 
those goals. These research questions will guide our investigation and contribute to 
developing advanced IDS solutions for Industry 5.0, strengthening its cybersecurity 


posture and enhancing its resilience against evolving cyber threats. 


1.2 Research Methodology 


We have adopted the Systematic Literature Review approach (SLR) to identify rel- 
evant literature about our research interests. The primary objective is investigat- 
ing cyber security solutions for IDS implementation in Industrial IoT. The research 
methodology involved identifying, selecting, and evaluating proposed and related 


studies. Specific research questions were formulated, and Scopus academic search 


Research Questions Objectives 


RQ1: What are Industrial IoT networks’ vul- To explore potential weaknesses and design flaws that 

nerabilities and architectural characteristics may pose security risks in IloT infrastructures de- 

in Industry 5.0? ployed in smart industry settings. Understanding 
these vulnerabilities and architectural characteristics is 
crucial for developing effective security measures and 
intrusion detection strategies. 


RQ2: What are the systems architecture and To investigate and analyze the different systems ar- 

the technology type used by IDSs to se- chitectures and technology types employed by IDSs 

cure IloT networks and their various com- specifically tailored for securing IloT networks and 

ponents? their constituent components. Subsequently, we aim 
to provide insights into these systems’ key design 
principles and implementation strategies. 


RQ3: What are the used IDS detection To identify and assess the various IDS detection 

methodologies for oT? methodologies specifically designed and applied for 
IloT environments. By assessing these methodologies’ 
efficacy, strengths, and limitations, we aim to gain a 
thorough understanding of their capabilities in detect- 
ing and mitigating cyber threats in IloT networks. 


RQ4: How are emerging technologies con- To explore integrating and consolidating emerging 

solidated for effective and secure detection? technologies, such as ML and DL, Cloud/fog services, 
Big data analytics, and Edge intelligence, to devise effi- 
cient and secure detection mechanisms tailored to IloT 
systems. We enhance privacy preservation and the 
overall resilience and reliability of IloT security 


RQ5: What are the used IDS evaluation per- To investigate and evaluate the performance metrics 

formance and the experimental datasets? | commonly employed to assess the effectiveness of 
IDSs in IloT settings. We aim to explore the exper- 
imental datasets utilized for comprehensive testing 
and validation of IDS functionalities in realistic IloT 
scenarios. 


TABLE 1.1: Research Questions and Objectives. 


queries were conducted in the "Title," "Keywords," and "Abstract" fields of relevant 
publications. The search results were confined to publications from 2015 to 2021. The 
study selection process was refined by focusing on practical studies that align with 
the research questions described in Table 1.1. By leveraging the SLR methodology, we 


aim to contribute to the understanding and advancement of cyber security measures 


within the Industrial Internet of Things. 
Furthermore, we formulate our research strategy using the PICO framework. we 
ensure a structured and systematic approach to address the complex aspects of IDS 


implementation in IIoT. We identify the PICO research question as follows : 


¢ Population (P): Our research focuses on IDS-based cyber threat detection in In- 


dustrial IoT. 


¢ Intervention (I): We consider all the proposed works of IDS within the domain 


of IIoT. 


¢ Comparison (C): Our investigation compares various IDS methods based on cri- 


teria and factors outlined in related studies and proposed solutions. 


¢ Outcomes (O): The primary objectives of our study are to establish require- 
ments, address challenges, and propose evaluation mechanisms for IDS-based 
solutions to enhance the security of IloT. These findings will serve as valuable 


contributions to further research in this area. 


This framework allows us to guide our investigation and enable valuable insights into 


developing and improving IDS solutions for securing IloT environments. 


1.3 Main Contributions 


The main contributions of this thesis are summarised as follows: 


1. A systematic review of IDS-based cyber threat detection for Industrial loT was 
conducted, encompassing a comprehensive examination of deployment strate- 
gies, detection approaches, methodologies, and data sources employed for eval- 
uation. The findings of this review highlight significant insights for the field. 
Furthermore, a critical analysis of well-selected literature reveals future direc- 
tions and challenges that must be carefully navigated when designing robust 


IDS solutions to enhance the security of loT-enabled critical infrastructure within 


industrial sectors. This contribution was presented at the 2021 International 
Conference on Theoretical and Applicative Aspects of Computer Science (IC- 


TAACS 2021) [1]. 


. A federated learning methodology for Android malware detection based on net- 
work analysis was proposed. This contribution introduced the federated learn- 
ing (FL) paradigm as a cost-effective deep-learning-based malware detection 
using network traffic data. The aim is to overcome the computational overhead 
and privacy concerns of conventional malware detection strategies while main- 
taining the efficiency of detecting large-scale malware attacks. The performance 
of our methodology was evaluated using the benchmark dataset AAGM-2017 
across various FL settings, and its outcomes were compared against those of 
centralized training methods. The results demonstrate the efficiency and ef- 
fectiveness of Android malware detection in terms of detection accuracy and 
computation cost while providing data privacy without any significant adverse 
effects on the classification performance compared to conventional centralized 
approaches. This contribution was published as a chapter in Springer’s Cyber 


Malware [3]. 


. A two-stage intrusion detection framework for IoT security was proposed. This 
contribution introduced a dual-detector approach. An adversarial training strat- 
egy was used as a robust optimization approach against the emergent adversar- 
ial threats in the initial stage that employs the first detector. Subsequently, a 
DL model was employed for the second detector, focused on intrusion identifi- 
cation. The performance evaluation of this framework is conducted using the 
recently published Edge-IloTset dataset, we conducted evaluations in terms of 
detection accuracy and resilience against adversarial attacks. The experimental 
results underscore the proposed methodology’s effectiveness in detecting intru- 
sions and persistent adversarial examples. This contribution was presented at 


the 2023 International Conference [4]. 


4. A Privacy-Preserving Secure Framework (PPSS) using Blockchain-enabled Fed- 
erated Deep Learning for Industrial IoT. The framework introduces a blockchain- 
based scheme designed to enhance the security of cross-organization Federated 
Learning (FL), ensuring the process remains secure while minimizing adverse 
effects on learning performance. A novel lightweight and energy-efficient proof 
of learning, PoFDL, is proposed for effective model validation and storage. Ad- 
ditionally, integrating differential privacy training enhances the privacy protec- 
tion of model updates. The performance evaluation of the PPSS framework is 
conducted using the recently published Edge-IoTset dataset, employing convo- 
lutional neural networks (CNNs) as deep networks across various FL settings. 
The experimental results demonstrate clear evidence of the framework’s effi- 
ciency and effectiveness. Notably, the proposed framework’s capabilities are 
shown in handling heterogeneous datasets and addressing non-IID data dis- 
tribution. Moreover, the framework’s robustness against common blockchain 
attacks, including Byzantine attacks, Sybil attacks, and honest-but-curious at- 
tacks, is thoroughly assessed to ensure security and reliability. This contribution 


was published in Elsevier’s Pervasive and Mobile Computing [5]. 


5. A distributed learning paradigm has been proposed leveraging FL and gen- 
erative adversarial networks (GANs). The aim is to improve privacy protec- 
tion, facilitate effective training, and enable robust detection of large-scale cy- 
ber threats and emergent adversarial attacks. This contribution introduces a 
three-model framework incorporating Wasserstein-Conditional-GANs for data 
augmentation and a DL-classifier for cyber threat classification. First, a dis- 
tributed deep generative model was trained on highly imbalanced and non-IID 
distributed data under the FL paradigm. This model generates qualified and di- 
verse synthetic data. Subsequently, this augmented data undergoes validation 
using our proposed data curation method before being employed to train a fed- 
erated learning classifier. This process enhances resilience and enables efficient 


detection of novel cyber threats not initially in the training data. 


The performance evaluation of this framework is conducted using the recently 
published Edge-IoTset dataset. The evaluations encompass detection efficiency 
against recent state-of-the-art adversarial attacks and zero-day cyber threats. 
Furthermore, we assess the effectiveness of incorporating differential privacy 
training as an additional technique for improved privacy preservation and its 
impact on model performance. The experimental results demonstrate the valid- 
ity and diversity (multi-class) of the augmented data generated using the dis- 
tributed generative model. Additionally, the results highlight enhanced cost- 
effectiveness when utilizing the proposed data augmentation approach in con- 
trast to implementing DP training, particularly regarding privacy preservation. 


This contribution was published in Elsevier’s Internet of Things [6]. 


1.4 List of Publications 


Journal papers 


¢ Hamouda, D., Ferrag, M. A., Benhamida, N., & Seridi, H. (2022). PPSS: A 
privacy-preserving secure framework using blockchain-enabled federated deep 
learning for Industrial IoT. Pervasive and Mobile Computing, 88, 101738. https: 
//doi.org/10.1016/j .pmcj .2022. 101738 


¢ Hamouda, D., Ferrag, M. A., Nadjette, B., Hamid, S & Ghanem, M. C. (2024). 
Revolutionizing intrusion detection in industrial loT with distributed learning 
and deep generative techniques. Internet of Things, 1-15. https: //doi.org/10. 
1016/j .iot.2024.101149 


Conference papers 


¢ Hamouda, D., Ferrag, M. A., Benhamida, N., & Seridi, H (2021, November), 


Android Malware detection based on network analysis and deep convolutional 


neural network. The 4th International Hybrid conference on Informatics and 


Applied Mathematics ([AM’21). 


¢ Hamouda, D., Ferrag, M. A., Benhamida, N., & Seridi, H. (2021, December). 
Intrusion detection systems for industrial internet of things: a survey. In 2021 
International Conference on Theoretical and Applicative Aspects of Computer 
Science (ICTAACS) (pp. 1-8). IEEE. https: //doi.org/10.1109/ICTAACS53298. 
2021.9715177 


¢ Hamouda, D., Ferrag, M. A., Benhamida, N., & Seridi, H (2022, November), 
Network-based Intrusion Detection Using Generative Adversarial Networks. 
The 5th International Hybrid Conference on Informatics and Applied Mathe- 
matics (IAM’22). 


e M.A. Ferrag, D. Hamouda, M. Debbah, L. Maglaras and A. Lakas, "Generative 
Adversarial Networks-Driven Cyber Threat Intelligence Detection Framework 
for Securing Internet of Things," 2023 19th International Conference on Dis- 
tributed Computing in Smart Systems and the Internet of Things (DCOSS-IoT), 
Pafos, Cyprus, 2023, pp. 196-200, https: //doi.org/10.1109/DCOSS-IoT58021. 
2023 .00042. 


Book Chapter 


¢ Hamouda, D., Ferrag, M.A., Benhamida, N., Kouahla, Z.E., Seridi, H. (2024). 
Android Malware Detection Based on Network Analysis and Federated Learn- 
ing. In: Almomani, IL, Maglaras, L.A., Ferrag, M.A., Ayres, N. (eds) Cyber 
Malware. Security Informatics and Law Enforcement. Springer, Cham. https: 


//doi.org/10.1007/978-3-031-34969-0_2 
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Co-authored papers 


e Ferrag, M. A., Friha, O., Hamouda, D., Maglaras, L., & Janicke, H. (2022). Edge- 
IloTset: A new comprehensive, realistic cyber security dataset of oT and IloT 
applications for centralized and federated learning. IEEE Access, 10, 40281- 
40306. https: //doi.org/10.1109/ACCESS . 2022. 3165809 


e Ferrag, M. A., Shu, L., Djallel, H., & Choo, K. K. R. (2021). Deep learning- 
based intrusion detection for distributed denial of service attack in agriculture 


4.0. Electronics, 10(11), 1257. https: //doi.org/10.3390/electronics10111257 


e Ferrag, M. A., Friha, O., Kantarci, B., Tihanyi, N., Cordeiro, L., Debbah, M., 
Hamouda, D..,... & Choo, K. K. R. (2023). Edge Learning for 6G-enabled Internet 
of Things: A Comprehensive Survey of Vulnerabilities, Datasets, and Defenses. 
in IEEE Communications Surveys & Tutorials. https: //doi.org/doi:10.1109/ 
COMST . 2023 . 3317242. 


1.5 Thesis Organisation 


The remaining parts of this thesis are organized as follows: Chapter 2 on page 12, ex- 
plores an IDS-oriented security solution for loT-enabled critical industrial infrastruc- 
ture. It examines its architecture, vulnerability to threat models, and security require- 
ments. The chapter reviews adaptive IDS implementations, deployment strategies, 
machine learning techniques, and blockchain technologies for privacy-preserving and 
secure IDS. The study emphasizes the necessity for efficient and sophisticated privacy- 
preserving IDS systems comparable to centralized approaches while addressing chal- 
lenges and security requirements. 

In Chapter 3 on page 38, an innovative federated learning (FL) paradigm and net- 
work behavior analysis for malware detection are proposed. The focus is on preserv- 


ing privacy, minimizing computation costs, and enhancing detection efficiency. The 
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chapter explores malware detection using network layer features. It presents an ef- 
ficient detection methodology using FL with a CNN approach and compared with 
conventional centralized methods, highlighting advantages regarding computation 
cost and privacy protection. 

In Chapter 4 on page 50, an innovative and privacy-preserving secure frame- 
work named PPSS is proposed. This chapter explores the development and exper- 
imental aspects of the PPSS framework. Topics covered include component inter- 
action, blockchain-enabled federated learning, secure communication, key manage- 
ment, proof of federated deep learning, and blockchain security analysis. The chap- 
ter also discusses how PPSS enables cyber threat detection, considering various sce- 
narios and experimental settings, including data distribution, global model accuracy, 
convergence time, differential privacy training, energy costs, and blockchain perfor- 
mance. 

In chapter 5 on page 85, an improved federated generative framework named 
FedGen-ID is proposed. This framework addresses imbalanced and private data 
challenges by employing distributed data augmentation techniques. It aims to en- 
hance efficiency and robustness against cyber threats. The chapter discussed using 
data augmentation methods to support a synthetically enhanced federated learning 
scheme, leading to improved detection efficiency and resilience against zero-day at- 
tacks. Three models are discussed: one refines local Critics to strengthen resilience, 
the second focuses on improving cybersecurity, and the third serves as a cyber threat 
classifier. 

In conclusion, Chapter 6 on page 113 summarizes the key findings from this re- 


search and presents recommendations for future work. 
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CHAPTER 2 


i ee 


CONCEPTS AND LITERATURE REVIEW 


“The art of war teaches us to rely not on the likelihood of the enemy's not 
coming, but on our own readiness to receive him; not on the chance of his not 


attacking, but rather on the fact that we have made our position unassailable” 


— Sun Tzu, The art of war 


Introduction 


The above quotes resonate with the essence of our research endeavor, emphasizing 
the paramount importance of preparedness and resilience in the face of potential 
threats. In the realm of cybersecurity, this principle becomes ever more relevant as 
we navigate the dynamic landscape of Industry 5.0 and the integration of technolo- 
gies such as the Internet of Things (IoT), cloud/fog computing, artificial intelligence 
(AI), and collaborative robotics to boost productivity and business. The industrial 
landscape has evolved into heightened interconnectivity and increased complexity. 
However, the evolution of this landscape has heightened its vulnerability to cyber 
intrusions, mainly due to the inherent security challenges embedded in the develop- 
ment of sophisticated technologies and their increased connectivity and exposure to 


public networks. Furthermore, the lack of worldwide-adopted technical standards 
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FIGURE 2.1: The Industrial loT network model: Layers, Threats, and 
Defense Strategies. 
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for IloT security and interoperability expands the sensitivity of this ecosystem to cy- 
ber security risks [7]. More than ever, cybersecurity breaches pose significant threats 
to enterprises across varying scales and sectors. The escalation of cybercrime has dra- 
matically increased and has made businesses much more likely to suffer from finan- 
cial and reputational damage due to cyber attacks, with damage related to cybercrime 
projected to hit $10 trillion annually by 2025 [8]. 

The role of cyber security in safeguarding Industry 5.0, including those in smart 
factory technologies and beyond, is crucial for ensuring confidentiality, integrity, and 
protection of shared information among interconnected components. To this end, 
security professionals and researchers recommend using and developing a proficient 
Intrusion Detection System (IDS) solution. This system serves as a robust security 
monitoring mechanism to identify ongoing potential security threats and safeguard 
industrial networks and control systems [1]. However, the prevalence of information 
and operational technologies in Industry 4.0 and 5.0 has changed the appearance of 
cyber threats and how we deal with them, as it also requires addressing challenges 
related to reliability, complexity, security, and data privacy [9]. 

In the following section, we explore the gap between the development of tradi- 
tional IDSs and the design of adequate IDS schemes for the unique challenges of 
Industrial IoT ecosystems. Our study comprehensively addresses this deficiency, ex- 
amines the most pertinent and effective detection strategies, and sheds light on the 
challenges and requirements of securing Industry 5.0 from emerging cyber threats. 
We intend to provide a comprehensive foundation for developing robust and future- 
proof intrusion detection solutions, fostering the continued growth and security of 


Industry 5.0. 
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2.1 Industrial IoT Architecture, Threat Models, and Se- 
curity Requirements 


In this section, we aim to explore potential weaknesses and design flaws that may 
pose security risks in Industrial IoT infrastructures deployed in the smart industry. 
Understanding these vulnerabilities and architectural characteristics is crucial for de- 
veloping effective security measures and detection strategies. 

The architectural framework of Industrial IoT architectures differs slightly from 
the conventional IoT and Cyber-Physical Systems (CPS) systems with additional crit- 
ical control systems and security challenges. A typical IloT architecture can be illus- 
trated by hierarchical layers of various networking technologies and communication 
protocols that establish interconnections between hardware devices, control software, 
and end-users. Figure 2.1 illustrates the network model, including layers, threat mod- 


els, and defensive mechanisms within the context of industrial IoT. 


¢ Physical layer: This layer is designed to collect data about the physical envi- 
ronment or to act on it, using Sensors, actuators, and meters. Devices of these 
layers are usually resource-constrained. At this stage, communication protocols 
and technologies were designed to operate at limited bandwidth, constrained 


CPU and memory capacity, and low energy consumption [10]. 


¢ Middle-ware layer: This segment manages field devices to facilitate the inte- 
gration and communication between physical objects and supervisory control 
systems. This layer includes applications like Programmable Logic Controller 
(PLC), Remote Terminal Unit (RTU), and Intelligent Electronic Device (IED). It 
also contains limited computation resources with heterogeneous communica- 
tions infrastructures, including wired and wireless connections that intercon- 


nect objects with control systems. 


Chapter 2. Concepts and Literature Review 16 


¢ Control layer: Designed to manage automation and intelligent control of the 
industrial infrastructure. It enables real-time processing of data collected from 
substations system control of the previous layer. This layer includes control sys- 
tems such as Supervisory Control and Data Acquisition (SCADA), Distributed 
Control System (DCS), HMI, and other applications such as data historian, Man- 
ufacturing Execution Systems (MES), and Enterprise Resource Planning (ERP). 


¢ DMZ Zone: Contains critical devices that must be exposed to the outer network, 
such as an App server, web server, etc. At this stage, internet connectivity and 


standard IT protocols interconnect OT with IT and users across longer distances. 


¢ Application layer: Includes processing and management tools that require costly 
computation and storage resources. Application at this layer is mainly based on 
cloud services and used to process the collected data to obtain valuable insights 
and information about the physical environment. Using AI approaches, these 
applications may make or reach decisions based on this information to control 


physical objects. 


This architectural composition of IloT, using diverse networking technologies and 
communication protocols, presents significant challenges for IDS and exacerbates the 
complexity of detection. Incorporating distinctive insecure-by-design protocols and 
various communication infrastructures, such as wireless networks, complicate cyber 
threat detection [11]. Furthermore, improving the security level while simultaneously 
ensuring the availability of IloT systems faces essential challenges due to resource 


restrictions [12]. 


2.1.1 Industrial IoT Threat Models 


Figure 2.1 and Table 2.1 highlight the various ways in which IloT networks can be 
compromised, emphasizing the need for robust security measures and continuous 


monitoring to mitigate these risks. 
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Threat Model 


Data Poisoning 


Sybil Attacks 


Data Privacy Intrusion 


Malware Injections 


Byzantine Attacks 


Protocol Exploitation 


AI Associated Threats 


External Threats 


Description 


This involves injecting malicious or inaccurate data into 
the IoT system, leading to incorrect decisions or actions 
by connected devices [13]. 


In this type of attack, an attacker creates multiple fake 
identities to overwhelm the system or gain unautho- 
rized access to IoT devices [14]. 


IoT devices often collect and transmit sensitive data. At- 
tackers may attempt to intercept or access this data, vio- 
lating users’ privacy [13]. 


Attackers can inject malware into IoT devices, compro- 
mising their functionality and potentially using them for 
malicious purposes [3]. 


These attacks involve compromised or malicious nodes 
within a network that intentionally provide conflict- 
ing information, leading to system failures or incorrect 
decision-making [14]. 


IoT devices communicate using various protocols. At- 
tackers can exploit vulnerabilities in these protocols to 
gain unauthorized access or manipulate device behav- 
ior [10]. 


As Alis integrated into IoT devices, attackers could tar- 
get vulnerabilities in AI algorithms to manipulate or dis- 
rupt device behavior and decision-making [15]. 


External attackers can target IoT devices by exploiting 
vulnerabilities in the devices’ software, firmware, or 
communication channels. 


TABLE 2.1: Threat Models Against IoT Network Architecture. 


In light of these multiple issues, it becomes more evident that a comprehensive 


strategy comprising improved security monitoring, proactive threat detection, and 


resource-efficient defensive mechanisms is important for adaptable IDS security solu- 


tions in HoT environments. 
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2.2 Adaptive Intrusion Detection System (IDS) for Se- 
curing Industrial IoT 


Cyber threat detection plays a pivotal role in addressing substantial security requisites 
within the context of complex technological landscapes of IloT environments. An IDS 
is a crucial component in this endeavor by actively monitoring system operations and 
network activities, investigating data patterns, and identifying anomalous behaviors 
that could potentially signify malicious or unauthorized activities [16]. Recently, ma- 
chine learning (ML) and deep learning (DL) have emerged as a recent advancement 
within the field of IDS, providing the means to identify novel effective, and continu- 
ally evolving forms of cyber attacks [17]. 

However, given the heterogeneous and distributed nature of data sources in con- 
junction with the inherent resource limitations pertaining to storage, energy, and com- 
putational capabilities of end-point IloT devices, it becomes imperative to meticu- 
lously incorporate considerations of resource utilization efficiency during the design 
and implementation of IDS security mechanisms [18]. Several studies have been con- 
ducted to tackle the deployment of IDS across diverse components within the IoT. 
This includes the examination of communication protocols [19] and the inclusion of 
Infrastructure Control Systems (ICS) sectors [20]. Furthermore, the application of IDS 
has expanded to encompass pivotal sectors like transportation [21], critical infrastruc- 
tures such as gas pipelines [22], and sophisticated domains like smart grids [23]. 

To gain a comprehensive insight into the distinctions between traditional IDS- 
based security systems implemented for information systems and the envisaged IDS 
systems tailored for Industrial lol deployments, we present a comprehensive IDS 


taxonomy founded upon detailed categorizations 2.2. 
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FIGURE 2.2: Taxonomy of IDS Solutions for Industrial IoT. 


2.2.1 Taxonomy of IDS Deployment Strategies in IloT Environments 


Implementing an IDS necessitates considering several aspects, including system ar- 
chitecture, deployment strategy, monitoring methodologies, and detection strategies 
[24]. Within the realm of Industry5.0, the convergence of IoT solutions within the 
frameworks of both Cloud and fog paradigms highlights the potential deployment 
strategies for IDS architectures. Moreover, the success of ML and DL techniques has 
significantly contributed to enhancing the capabilities of IDS systems. Driven by these 
advancements, researchers have advanced proficient IDS models tailored for the loT 


domain. Figure 2.2 presents a taxonomy of IDS implementation for Industrial IoT. 


¢ IDS Deployment Strategy : It largely depends on locations and data flow dy- 
namics within the IloT framework. Depending on whether data aggregation 
is centralized or distributed, the IDS can be established within the same node 
responsible for data collection or be distributed across multiple nodes to cover 
a broader span of the network landscape effectively. This deployment facili- 
tates the monitoring and analyzing IloT components, thereby fortifying plant 


networks against diverse cyber threats [1]. 
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— Centralized/Decentralized IDS : Aydogan et al. [19], studied the effective- 
ness of a centralized IDS in promptly detecting attacks, owing to its com- 
prehensive data aggregation, outperforming those of individual IDS agent 
nodes. Leveraging Cloud and Fog computing, a centralized approach could 
be adept and overcome the constraints posed by the limited resources of 
the IloT. Within this context, both data aggregation and the execution of 
resource-intensive processing are hosted by Cloud services [25]. However, 
addressing concerns about data privacy in this environment is important. 
Another study proposed by Ioannou et al. [26], demonstrated that decen- 
tralized IDS offers the distinct advantage of operating as a fault-tolerant 
system. Moreover, deploying multiple detection models within the hetero- 
geneous landscape of the IloT holds the promise of effectively identifying 


large-scale attacks while mitigating data privacy concerns [26, 27]. 


- Distributed IDS: integrating IDS across multiple agent nodes, collectively 
contributing to global decision-making based on gathered data. Lever- 
aging Edge Computing, distributed IDS offers benefits such as reduced 
observed data volume and energy-efficient task execution. For instance, 
Zhang et al. [28] propose a multi-layer data-driven IDS approach, expand- 
ing attack detection coverage. Khan et al. [22] introduce a multi-level 
anomaly detection strategy with distinct detection methods at each level. 
Shu et al. [21] demonstrate the efficacy of distributed IDS using both Inde- 
pendent Identically Distributed (IID) and non-IID data sources. 


¢ IDS Detection Strategy: The chosen detection strategy is vital for enhancing 
IDS performance and robustness. This involves a critical selection not only of 
the overall detection methodology—ranging from anomaly-based, signature- 
based, to hybrid systems—but also the effective inclusion of Machine Learning 
(ML) and Deep Learning (DL) approaches to ensure efficient and real-time de- 


tection capabilities. Moreover, the evolving landscape of IDS training paradigms 
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introduces the concept of federated learning in a new era of collaborative and 


decentralized model refinement over distributed IoT networks. 


— Detection Methodology : This includes distinct approaches. Signature- 
based detection involves storing known attack signatures to quickly iden- 
tify and validate future attacks, offering real-time and cost-effective detec- 
tion. However, it falls short against unknown and polymorphic attacks, ne- 
cessitating ongoing updates, human intervention, and secure connections. 
Anomaly-based detection relies on User Behavior Analysis (UBA) to create 
a dynamic detection model based on software, hardware, or human inter- 
actions. While efficient and self-adaptive for identifying unfamiliar attacks, 
it tends to produce more false alarms and requires increased computational 


resources [1]. 


Specification-based detection establishes legitimate behavior models through 
protocol or system analysis, detecting deviations from specifications with- 
out a training phase [29]. Although effective in spotting attacks, it falters 
against attacks conforming to the specification model. Combining these 
methodologies yields higher accuracy, lower false alarms, and real-time de- 
tection. For instance, Otoum et al [30]. proposed a hybrid IDS framework 
using loT gateways, integrating signature-based and anomaly-based ap- 
proaches for enhanced effectiveness. Feng and Chana [31] presented a com- 
parable method, augmenting their IDS with a baseline signature database 


for time-series anomaly detection, thereby bolstering system efficiency. 


- ML and DL-based detection: ML algorithm techniques have proven effec- 
tive in safeguarding IloT networks and their physical entities [22, 25, 28]. 
The application of these algorithms is particularly well-suited to the con- 
text of IloT due to its inherently task-oriented nature and the consistency 
of data distributions. These attributes serve to enhance both traffic pre- 


dictability and the efficiency of intrusion detection [1]. In this context, DL 
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encompassed as a subset of ML, manifests as a collection of sophisticated 
ensembles of operations that proficiently acquire multi-layered representa- 
tions [17]. However, the distinctive potential of DL-based IDS comes up- 
front when confronted with enormous quantities of training data. Various 
DL approaches are adeptly equipped to manage and counteract the diverse 
spectrum of intrusions and cyber-attacks. This encompasses varying de- 


grees of intricacy, complexity, and distribution levels [32]. 


Although the application of ML-based and DL-based detection approaches 
has shown success in enhancing the security of IloT through IDS, it is im- 
portant to recognize certain limitations that require careful consideration. 
Both ML and DL models are sensitive to slight changes in data, which can 
significantly decrease detection and attack classification performance. The 
lack of interpretability and the limited transparency of decision-making 
processes present challenges in understanding the origins of attacks and 
conducting further forensic analyses. Additionally, the computational de- 
mands for data processing and learning exceed the capacities of available 


IIloT resources [1]. 


Equally important are adversarial attacks aimed at undermining the ef- 
fectiveness of model learning, resulting in the evasion of the detection of 
malicious activities. This emerging challenge holds significance for both 
ML-based and DL-based IDS, [4]. Addressing these complexities is cru- 
cial for establishing strong and dependable security measures within IloT 


environments. 


— Federated Learning-based IDS (FL-IDS): In light of the challenges above, 
FL is proposed as a promising training approach for ML and DL-based IDS 
for IloT [27]. Its distributed and privacy-preserving approach aligns well 
with the characteristics of IloT environments, offering a pathway to im- 
proved detection accuracy, data privacy, and resource efficiency [33]. More 


about this paradigm is in the following sections. 
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¢ IDS Data Source : Input data are essential characteristics to detect large-scale 
attacks effectively. Various dimensions offer insights into their sources, charac- 


teristics, and analytic potential. 


- Data source: Includes two principles; Network-Based Data and Host-Based 
Data. Industrial IloT also deploys physical information known as state- 
based IDS. Generally, a hybrid approach is often employed to achieve com- 
prehensive detection results in time [34]. An example of hybrid-based IDS 
is proposed by Zhang et al. [28] to robustly detect intrusions that may 
not be detectable by monitoring network and host system data, such as 
command tampering and false data injection attacks by an insider in ICSs. 
Zhou et al. [34] proposed multiple data models to represent the general 
knowledge of Industrial Process Control Systems (PCS) to facilitate the im- 
plementation of hybrid anomaly-based IDS. 


- Data Granularity : Refers to the level of detail at which data is collected, 
processed, or stored in an information system. It could be raw packet- 
level data or aggregated flow-level data, or any equivalent level of data 


ageregation [35]. 


2.3 Privacy Preserving Intrusion Detection in Industrial 


IoT Network 


An IDS security system aims to safeguard against security breaches by analyzing 
monitored data and detecting potential cyber threats. However, its implementation 
introduces a challenge to users’ privacy, leading to the need for sophisticated IDS 
mechanisms that prioritize privacy preservation. This development has roots in ear- 
lier research, such as Park et al. [36], who employed cryptographic techniques to 
enhance the security of log files. In contemporary times, characterized by the deploy- 


ment of ML and DL in various industries, major privacy concerns have been raised 
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where the handling of sensitive proprietary data is a prominent issue. This concern 
extends to the domain of IDS development within the context of IloT security, which 
consequently stimulated a demand for innovative conceptual frameworks that ac- 


commodate privacy preservation and security. 


2.3.1 Federated Learning-based IDS 


Federated Learning (FL) introduces an innovative way to collaboratively learn and 
distribute computations for applications based on ML and DL. Instead of sending 
client data to a central server, FL sends models from the server to specific clients. 
These clients then train the models using their local data and conventional ML meth- 
ods. This approach ensures privacy and security by keeping data on the client’s side 
[37]. 

In safeguarding Industrial IoT, an FL-based IDS emerges as a dependable security 
strategy. It addresses the security needs and challenges of IloT by enabling decen- 
tralized decision-making for IDS across diverse IIoT setups. To visualize, Figure 2.3 
depicts an overview of the Industrial loT network model and the organizational struc- 
ture of an FL-based IDS system. Algorithm 1 outlines the core process of the FL-based 
IDS for securing IloT environments. This algorithm enables collaborative learning 
and distributed computations across client devices while maintaining data privacy. 
As demonstrated, the workflow started with the Server component initializing the 
model and orchestrating the FL process over several rounds. Each round randomly 
selects a subset of clients from the total client pool. These clients then participate in 
parallel computations to update their local models. The algorithm aggregates these 
client model updates to refine the global model at the server. 

On the client side, represented by the Client (i.e., device) component, each client 
operates independently. They split their local dataset into batches and perform local 
training epochs on their data batches. These local model updates are communicated 


back to the server for aggregation. Thus ensuring data privacy preservation. 
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FIGURE 2.3: The Industrial loT Network Model with an Organizational 
Chart of an FL-Based IDS System. 


In a recent study [21], Shu et al. developed a collaborative IDS for Vehicular Adhoc 
Networks (VANETs) within a distributed SDN environment, utilizing multiple SDN 
controllers to train a single IDS model. The unique aspect is that they achieve this 
without directly sharing their sub-network data flows. Another approach, proposed 
by Fan et al. [38] revolves around an FL-based IDS framework and combining cloud 
and edge computing services to maintain privacy and coordinate the FL process. In 
a different study, Nguyen et al. [39] introduced an FL-based IDS, where a distinct 
detection model is developed for each IoT device, with security gateways building 
local models using unlabeled crowd-sourced traffic. 

Although the FL-based IDS training paradigm ensures privacy-preserving and 
knowledge sharing and boosts efficient cyber threat detection that works well even 
with limited and dispersed IloT data sources, specific challenges have emerged with 


its adoption. These include communication costs in extensive distribution, selecting 
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Algorithm 1: Federated Learning-based Intrusion Detection [37]. 
1 Server (K : Number of Selected Clients, C : Total Clients, R : Total Rounds) 
Initialize model, 
fort =1,..,R do 
St < Randomly select K clients from C 
Parallel.for k € S; do 
| model', , < Client(model;, k) 
end 
model s1 — ¢ Ly model, , 
end 
Client (i.e., device) (m : Model, k : Client ID) 
Split the local dataset D into B local data batches 
B + Split(D, B) 
fori = 1,..,E : Local epochs do 
forb € Bdo 
| m<m—nVf-(m,b) 
end 
end 
Send m to the Server 


N qa oF FF WY N 


oo manta F WN F © @w 


suitable equipment for federation, uneven distribution of data and resources, ensur- 
ing the security of FL audit gateways, and addressing issues related to adversarial 


attacks [1]. 


2.3.2. Blockchain based IDS 


Adopting blockchain technology can make intrusion detection systems in IoT more 
secure. It offers a safe and decentralized way to store and share intrusion detec- 
tion data, helping quickly identify and prevent manipulated data and injection at- 
tacks [40]. It brings several key features. Firstly, it establishes a decentralized net- 
work where data is stored across multiple nodes, eliminating central control and 
enhancing security against attacks. Secondly, information recorded on a blockchain 
is immutable, preventing alterations and safeguarding IDS data from manipulation. 


Thirdly, transparency is fostered, allowing all network nodes to access the same data, 
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and facilitating easier attack detection [41]. Additionally, using smart contracts auto- 
mates the process of IDS, minimizing errors [42]. Furthermore, device authentication 
ensures that only authorized IoT devices access the network, reducing attack risks. 
Lastly, blockchain promotes collaboration among nodes, enabling shared data and 
cooperative defense strategies, thereby improving threat identification and response 


capabilities [43]. 


By combining blockchain and federated learning, a hybrid approach offers a secure 
and effective solution for adaptive IDS in HoT. For instance, Kumar et al. [44] pro- 
posed an intelligent blockchain framework that integrates smart contracts for data 
authentication and FL-based IDS to mitigate data poisoning attacks. Similarly, Wang 
et al. [45] designed a blockchain-enabled decentralized FL to alleviate data falsifi- 
cation issues and reduce communication costs between cloud and edge devices. The 
PEFL framework uses two-level privacy-preserving modules: perturbation-based pri- 
vacy, and DL-based intrusion detection. 

Although this hybrid strategy establishes a decentralized network, ensuring pri- 
vacy protection and secure FL automation, the integration of blockchain with differ- 
ent aspects and settings of federated learning, particularly in resource-constrained 


IloT environments, remains a challenge. 


2.3.3 Comprehensive Analysis Framework for Privacy-Preserving IDS 


The interconnection of loT-enabled industrial infrastructure using Cloud and Edge 
paradigms and ML and big data analytics illustrates how a security framework can be 
deployed to ensure secure data transmission and maintain privacy between Industry 
5.0 components. 

Several privacy-preserving and secure frameworks have recently been proposed 
for various Industry 5.0 applications. Table 2.2 provides an overview of these propos- 


als to advance privacy-preserving IDS. 
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Main Idea Challenges Domain Pros Cons Cite* 
A combination of secure Sophisticated DL applica- Enhanced pri- Additional com- [46] 
aggregation and differen- privacy threats tion vacy protection putational over- 
tial privacy techniques en- _ and data utility head, and de- 
sures the protection of crease in accu- 
data privacy while pre- racy 
serving data utility 
A blockchain scheme that _ Privacy- DL applica- Secure commu- Additional com- [47] 
uses smart contracts to preserving tion nication and Se- __ putational over- 
securely aggregate partic- aggregation of cure multi-party head, and not 
ipants’ local model up- model updates, computation suitable for all 
dates. contribution FL scenarios 
evaluation and 
reward mecha- 
nisms in FL 
Federated GAN training High-quality Renewable Produce realis- Training stabil- [48] 
involves the integration of | and diversified energy tic and diverse ity and conver- 
a least squares loss func- data augmenta- data while pre- gence issues, in- 
tion to mitigate mode col- tion serving the pri- secure commu- 
lapse issues vacy of the data. nication 
Protecting the confiden- Protecting data Active Securely pre- computational [49] 
tiality of sensitive dataon privacy while learning venting gra- _ overhead, scala- 
active learning by using preserving data application dient leakage bility concerns, 
FL with homomorphic en- _ utility during FL while and privacy 
cryption property preserving and data utility 
model accuracy. _ trade-offs. 
GANs training within the _non-IID clients Computer Improved per- Training stabil- [50] 
FL framework vision formance of FL ity and conver- 
gence issues, in- 
secure commu- 
nication 
Blockchain and FL inte- Ensuring the in- Computer Enhanced  se- increased com- [51] 
gration involve clients tegrity and se- vision curity and putational 
uploading model updates, curity of FL trustworthiness complexity, Po- 
workers creating valid of FL tential security 
blocks, and a_ trusted threats against 
committee verifying the the blockchain 
aggregated model through network 
an evolving verification 
contract over training 
iterations. 
Generating synthetic data Data storage Computer Efficient GAN Assume that the [52] 
using FL and GAN while and improved vision training anden- data is iid., but 
ensuring differential pri- privacy protec- hanced privacy this assumption 
vacy. tion preservation may not hold in 
real-world situ- 
ations. 
A decentralized approach Maintaining the Healthcare Secure and effi- | Computational [53] 
using blockchain to store accuracy and se- cient data shar- overhead, po- 
and share data among the curity of health- ing tential sophis- 
Edge nodes and local FL care data ticated privacy 
training on these data. threats against 
blockchain net- 
work 
Blockchain and FL to im- Centralized Railway in- Incentive mech- Potential so- [54] 
prove the accuracy and server, security dustry anism for  phisticated 
precision of data mining and privacy participating privacy threats 
while ensuring informa- threats devices, reliabil- against the 


tion privacy and security 


TABLE 2.2: Overview of Privacy-Preserving and Secure Frameworks in 


ity, and system 
robustness. 


Other Domains. 


blockchain net- 
work 
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—+| Liuet al. 2022 [72] 


—+| Rathee et al. 2022 [41] 


—+| Peng et al. 2021 [51] 


—| Blockchain-based | 
4] Qiu et al. 2020 [71] 


—+} Liang et al. 2020 [70] 


—+| Bravo et al. 2019 [69] 


—+| Tabasum et al. 2022/68] 


—+| Liet al. 2022 [67] 


—+| Friha et al. 2022 [66] 


—+| Attota et al. 2021 [65] 


Privacy- 
preserving IDS 


—| Federated Learning-based | 


-—+| Kumar et al. 2021 [64] 


—+| Ruzafa et al. 2021 [63] 


—+| Zhao et al. 2020 [62] 


—+| Rahman et al. 2020 [61] 


A.Basset et 
al. 2021 [60] 


—+} Kumar et al. 2021 [44] 


—| Liu et al. 2021 [59] 


| Hybrid Framework 


LH Wan et al. 2021 [45] 


—+| Singh et al. 2022 [58] 


i—+| Wei et al. 2022 [57] 


—+| Islam et al. 2022 [56] 


—+| Lakhan et al. 2022 [55] 


FIGURE 2.4: Privacy preserving IDS for Industrial IoT. 
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Main Idea Challenges Privacy Pros Cons Dataset Cite* 
Technique 
FDL using federated av- classification FL Outperform Computation Bot-IoT, N- [73] 
eraging aggregation after performance, localized and overhead (long  BaloT 
several communication and communi- Distributed DL _ training time) 
rounds for botnet attack cation efficiency methods in 
detection memory and 
communication 
efficiency 

An ensemble multi-view Data het- FL Improved accu- Insecure com- MQOTT [65] 
FL approach that trains on _erogeneity, racy, privacy- munication and 
multiple views of loT net- resource —con- preserving potential for 
work data, using a combi- __ straints distributed model poison- 
nation of local and global learning ing attacks 
models 
Distributing a GAN net- Data hetero- FL+GAN Improved Data reliability, _KDD99, [68] 
work across IoT devices geneity, non-IID model con- insecure com- NSL_KDD, 
to function as a classifier data and_pri- vergence and munication,and UNSW- 
and training it using lo- vacy concerns accuracy communication NB15 
cally augmented data overhead. 
Robust FL using a GAN Dynamic poi- FL+GAN Improved accu- Data reliability, Drebin, [74] 
approach to monitor the soning attacks racy and data insecure com- Genome, 
global model aggregation against FL, privacy munication,and Contagio 
for detecting Android mal-_ Integrity and communication (Android 
ware applications in IloT reliability of the overhead. malware) 

FL 
FDL by deploying a deep ‘Privacy threats, FL + Data Enhanced pri- Balancing be- ToN-IoT [64] 
privacy-encoding mecha- anddatahetero- Perturba- vacy tween data 
nism that perturbs the geneity tion utility and_pri- 
data before sending it to vacy 
the server. 
A new type of poison- Poisoning at- FL Novel poison- Difficult to im- N/A [75] 
ing attack manipulates the _ tacks against FL ing attacks that plement in prac- 
global model using GAN can be manipu- _ tice 
and assesses its effective- lated 
ness against existing de- 
fense mechanisms 
Homomorphic encryption Inspect the Homomorphic Perform clus- Not be suitable N/A [76] 
encrypts the IDSalertsand content of en- encryption tering on en- for real-time de- 
performs clustering onthe crypted__ traffic, crypted data tection 
encrypted data without re- Computational 
vealing the original data. overhead 
Deploy  blockchain to data privacy, re-  Blockchain Improved scala- Complexity of Ton-IoT [60] 
enable secure and decen- liability, and se- + FL bility and flexi- the system, po- 
tralized data sharing and curity in a de- bility tential vulner- 
management in smart centralized sys- abilities within 
transportation systems tem. Communi- blockchain par- 

cation and com- ticipants 

putation limita- 

tions 
Combining FL and fraud- Security and FL + Provides data Require signif- N/A [55] 
enabled blockchain, — scalability of the  Blockchain provenance icant computa- 
providing data prove- FLsystem and permission — tional resources 
nance and _ permission control of FL and may _ not 
control of the participants participants be feasible for 
to enhance the security small-scale 
and privacy of parameters healthcare orga- 
in FL nizations 
A decentralized and One point of FL + Def- Secure commu-_ Decreased accu-  EdgelloTset [66] 


differentially private FL- 
based IDS 


failure, privacy 
threats 


erential pri- 
vacy 


nication and im- 
proved privacy 


racy with higher 
privacy regimes 


TABLE 2.3: Privacy-Preserving IDS Overview. 
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This overview offers numerous advantages, including transferring privacy-preserving 
techniques, insights into distinct privacy concerns across domains, scalability solu- 
tions, enhanced robustness and security measures, and optimizing resources. By 
leveraging knowledge and techniques from diverse domains, we can develop more 
effective and comprehensive privacy-preserving IDS systems capable of addressing 
the specific privacy challenges inherent to intrusion detection. 

Figure 2.4 illustrates a taxonomy of recent advancements in privacy-preserving 
IDS frameworks, categorized into three groups: blockchain-based IDS, federated learning- 
based IDS, and hybrid approaches. This taxonomy highlights the growing research 
on balancing threat detection with privacy protection. 

Table 2.3 introduces a structured framework to explore privacy-preserving IDS 
within the IIloT context. It aims to provide a concise yet comprehensive overview of 
critical aspects of these systems, enabling scholars and researchers to systematically 
assess and compare various IDS implementations. The selection of columns covers 
essential dimensions of privacy-preserving IDS, clarifying core concepts, addressing 


challenges, enhancing privacy strategies, and assessing pros and cons. 


2.4 Performance Analysis of Intrusion Detection System 
in Industrial IoT 


Researcher commonly evaluate their proposed IDS solutions through validation strate- 
gies such as Hypothetical, Empirical, Simulation, or Theoretical methods, as detailed 
in [77]. A validation strategy ensures that the proposed IDS scheme suits its intended 
purpose and meets all requirements. This evaluation assesses whether the IDS de- 
tection strategy performs well according to predetermined objectives. The assessed 
works performed evaluation using Empirical and Simulation validation methods. In 
Empirical evaluation, real-world Industrial loT (IloT) data is used, while Simulation 
evaluation utilizes real IloT network traces. Through the literature review, we synthe- 


size the key performance metrics that illustrate the overall efficiency and effectiveness 
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of IDS for IloT [1]. These metrics include [1]: 


e Accuracy: This metric pertains to correctly identifying attacks and minimizing 
false alarms. Precision, Recall, True Positive Rate, False Positive Rate, and True 


Negative Rate are various Accuracy components. 


¢ Complexity: This evaluates the resource expenditure (time, memory, energy, 
bandwidth) during IDS operations, including model learning and audit event 
processing. It measures real-time detection capabilities and the feasibility of 
implementation on resource-constrained devices. Complexity metrics are often 
omitted in proposed IoT IDS solutions, hindering proper effectiveness and real- 


time potential assessment. 


¢ Completeness: This indicator assesses the ability of an IDS to reliably and effec- 
tively detect known and unknown threats. In the context of IloT, completeness 
is measured by an IDS’s applicability to large-scale infrastructures and its capa- 


bility to handle diverse data sources. 


¢ Scalability: This indicates an IDS’s ability to maintain detection effectiveness as 
the number of different behaviors grows due to IloT advancements. Adaptive 
and self-learning IDSs autonomously generate and store information or profiles 


of previously encountered events, applying them to future detection scenarios. 


2.4.1 Evaluation Datasets for IDS in Industrial IoT Networks 


The imperative necessity of substantial online or offline voluminous datasets for the 
rigorous evaluation and credibility of Al-driven IDS remains indisputable. The scarcity 
of authentic, real-world data emanating from Industrial IoT contexts, primarily at- 
tributable to concerns regarding privacy, has notably catalyzed a proactive response 
from the research community. This proactive response materialized through endeav- 
ors and efforts to provide practically oriented industrial datasets that accurately cap- 


ture the complexities of real industrial scenarios. To this end, specific datasets have 
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been simulated using various testbed setups incorporating relevant data from Indus- 
trial Internet of Things (IloT) components. These simulations are conducted to assess 
and appraise ML-oriented IDS within IloT. Examples of such datasets are N-BaloT, 
SWaT, TON_IoT, and EdgelloTSet datasets. 

However, it is noteworthy that conventional network traffic IDS datasets, includ- 
ing NSL-KDD, UNSW-NB15, and CICIDS2017, remain relevant within the IloT land- 
scape. In this context, these traditional datasets often serve as a collection that demon- 
strates data heterogeneity and the complex behaviors of various cyber-attacks or demon- 
strates the efficiency and effectiveness of particular ML-based detection approaches 
that can be used in resource-constrained IoT. Table. 2.4 describes the commonly used 


datasets to validate IDS-based cyber security in IoT. 


2.5 Research Gaps 


Drawing insights from a thorough review of relevant literature, designing a cost- 
effective yet efficient detection methodology, including factors such as detection rate 
and decision latency, stands as an open research issue within the domain of IDS-based 
security solutions for Industrial IoT. Table 2.5 lists research gaps related to IDS deploy- 
ment in Industrial IoT, grouped by key qualities. These gaps are crucial for advancing 
IDS security in IloT environments. Our thesis proposal has three key contributions 
to address the challenges above and open issues. The first contribution, detailed in 
Chapter 3 on page 38, presents a cost-effective and efficient IDS approach to detect 
malware network attacks targeting industrial Android systems. We have addressed 
computation efficiency and data privacy in this context by leveraging the FL training 


framework. 


The second contribution, detailed in Chapter 4 on page 50, presents a novel privacy- 
preserving secure framework that incorporates blockchain-enabled federated learn- 


ing. Within this context, we have addressed the detection of large-scale cyber-attacks 
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Dataset* Year Description Limitations 
NSL-KDD [78] 2009 Animproved version of KDD 99. Used to evaluate lacks real-world 

detection efficiency against huge network data but _IoT traffic 
not specific to IloT 
UNSW-NB15 2015 Benchmark dataset. Provided by the Australian cy- lacks real-world 
[79] bersecurity lab, contains real-world normal and at- _IoT traffic 
tack traffic scenarios for NIDS evaluation 
SWaT [80] 2016 Collected from a water treatment testbed, contains Small size, Limited 
time series traffic data scope 
CICIDS2017 2017 Proposed by the Canadian Institute for Cybersecu- lacks real-world 
[81] rity. Contains network traffic flow with the most _IoT traffic 
common attacks 
TON_IoT [82] 2020 Collected from heterogeneous data sources: Limited features 
Telemetry datasets of IoT services, Windows and __ representation, 
Linux Operating systems, Network traffic datasets Lacks Industrial 
IoT data. 
N-BaloT [83] 2018 Collected from a simulated IoT environment to Limited threat 
capture several normal and botnet events model. Lacks 
Industrial IoT data. 
Bot-IoT [84] 2019 Comprises legitimate and malicious traffic from Lacks Industrial 
IoT devices, including botnets on IoT networks. IoT data. 
MOQTTset [85] 2020 Utilizes MOTT protocol traffic and various attack Limited to only 
streams related to IoT devices. MOTT traffic. 
X-IIoTID [86] | 2021 Encompasses connectivity and device-agnostic Convenient for cen- 
data in the context of ML/DL-based IDS for both _ tralized learning 
IoT and Industrial IoT. 
WUSTL-IIOT- 2021 Created using legitimate and malicious data gen- lacks real-world 
2021 [87] erated by various IloT and industrial devices to IoT traffic. Limited 
mimic an actual industrial application. attack data. 


TABLE 2.4: Datasets Used for IDS Cybersecurity Evaluation in IoT/TIoT. 
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key quality 


Related challenges and open issues 


Data Sources 


Detection Methodology 


System Deployment 


Performance 


Security 
with IDS 


risks 


associated 


Industrial data are large-scale and heterogeneous, stem- 
ming from diverse origins, networking technologies, 
and communication protocols, pose significant is- 
sues. Addressing imbalanced and Non-Identically Dis- 
tributed Data (Non-IID) within this context remains un- 
explored research. Also, handling big data requires 
expensive processing methods impacting IDS perfor- 
mance dynamics. Conversely, some industrial scenar- 
ios lack the data volume needed for effective anomaly- 
based IDS. Lastly, the scarcity of authentic IloT datasets 
and suitable testbeds casts doubt on the credibility of 
proposed IDS frameworks. 


Employing behavioral analysis for threat detection in 
the context of IloT pose a significant challenge. There 
are numerous cases where normal behavior occurs in- 
frequently, and it’s crucial to accurately differentiate be- 
tween transient faults and potential threats or anoma- 
lies. Anomaly-based detection approaches frequently 
struggle with rising or diminished sensitivity, particu- 
larly when confronted with a growing range of diverse 
behavioral patterns. 


The limited resources in IloT environments constrain the 
availability of resources for implementing efficient IDS 
solutions. 


The vulnerability stemming from inadequately secured 
IloT communication protocols introduces an element of 
unpredictability to the spectrum of cyber threats while 
concurrently escalating the prevalence of false posi- 
tive instances. Pursuing cost-effective IDS solutions in- 
evitably impinges on the trade-off between accuracy 
and real-time detection, manifesting as a pivotal concern 
for IIoT security. 


Source data must be protected during acquisition 
and subsequent processing by IDS nodes. Privacy- 
preserving techniques such as differential privacy have 
adverse effects on detection performance. Adversarial 
attacks, such as data poisoning and evasion attacks, un- 
dermine IDS performance 


TABLE 2.5: Research gaps for IDS Deployment in Industrial IoT. 
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in resource-constrained and heterogeneous industrial systems without exposing data 
to privacy issues. Furthermore, we have investigated differential privacy-enhanced 
FL and related security issues by deploying a novel blockchain design scheme. Em- 
pirical validation of our framework employs a novel industrial IoT dataset (Edge-IloT 
dataset) to demonstrate the efficiency and effectiveness of our framework in terms of 
detection accuracy, computation overhead, and energy cost. The results demonstrate 
that our proposed secure system can efficiently detect and identify industrial IloT at- 
tacks with high classification performance even when subjected to distinct data distri- 


bution modes (namely, Independent and Non-Independent Identically Distributed). 


Our third contribution, detailed in Chapter 5 on page 85, presents a further inves- 
tigation into the efficiency and robustness of [DS-based security in the IoT. Specifi- 
cally, we proposed a novel Distributed Learning and Deep Generative Model-Based 
Intrusion Detection Technique. Within this paradigm, we addressed robust optimiza- 
tion against zero-days and adversarial attacks and the challenges related to imbal- 
anced and highly non-IID distributed data. Furthermore, we investigated differential 
privacy-enhanced distributed learning against model performance degradation. Our 
empirical validation on the same dataset demonstrated improved efficiency and reli- 
ability against zero-day cyber threats. 

These contributions consequently contribute to the continued growth and advance- 
ment of the IloT security landscape. The outcomes of our research have the potential 
to enhance the overall efficiency and reliability of IDS-oriented security within the 


domain of IloT.of critical industrial systems. 


2.6 Chapter Summary 


This chapter aims to comprehend the fundamental concept of an IDS-oriented secu- 
rity solution for loT-enabled critical industrial infrastructure. The discourse unfolds 


by investigating the Industrial IoT architecture, its vulnerability to threat models, and 
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the associated security requirements. Subsequently, an in-depth review of adaptive 


IDS implementation within the context of Industrial IoT is conducted. 
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CHAPTER 3 ee 
| ceprearen LEARNING FOR ANDROID MALWARE 
DETECTION 


“If you think technology can solve your security problems, then you don't 


understand the problems and you don’t understand the technology ” 


— Bruce Schneier 


3.1 Introduction 


Android is a popular open-source operating system with extensive traction in indus- 
trial lIoT deployments due to its ability to enhance convenience and operational effi- 
ciency [88, 89]. However, this widespread adoption has inadvertently rendered these 
systems attractive targets for cybercriminals. This heightened appeal arises from the 
fact that industrial systems house valuable assets and store sensitive information es- 
sential for the seamless functioning of operational technology. Consequently, mali- 
cious actors are increasingly drawn to the potential of planting their malicious apps 
to exploit vulnerabilities in Android systems, spread through networks, and conduct 
devastating cyber attacks and privacy intrusions over a large network of connected 


devices. 
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This chapter introduces a novel, privacy-preserving, cost-effective, and efficient 
approach to deep-learning-based malware detection, employing the emerging Feder- 
ated Learning (FL) paradigm and network analysis. Specifically, we propose Feder- 
ated Convolutional Neural Networks (FedCNN) to detect several types of malware 
based on abnormal network behavior. By leveraging FL and network traffic data, this 
methodology addresses computational overhead and privacy considerations, miti- 
gating large-scale and sophisticated malware attacks that could undermine Industry 
5.0’s core principles. 

The remainder of this chapter is organized as follows: Section 3.2 provides an 
overview of malware detection strategies. Section 3.3 details the development and 
experimentation of the proposed FedCNN model for malware detection, including 
data processing, training methodologies, and performance evaluation. Finally, Sec- 
tion 3.4 presents the results and discussion on the proposed model’s efficacy in de- 
tecting Android malware, compared with conventional centralized methods in terms 


of computation cost and privacy protection. 


3.2 Malware Detection Strategies 


Malware Analysis Techniques Malware Detection Strategy 
< Static Analysis > < Dynamic Analysis > < Placement strategy » < Detection Approach» 
v v 
Cloud-based Machine Learning 
v 
Parse Application Application Network ; 
source code Behaviour Behaviour Host-based Deep Learning 
a) Taxonomy of pagel ee hi techniques for feature b) Taxonomy of Malware detection techniques 


FIGURE 3.1: A Taxonomy of Malware Analysis Techniques and Detection 
Strategies. 


Several studies have been conducted to detect malware, generally encompassing 


two key phases: malware analysis and detection, as demonstrated in Figure 3.1 [90]. 


Chapter 3. Federated Learning for Android Malware Detection 40 


The former entails techniques for analysis and processing to facilitate detection. Static 
analysis involves scrutinizing malware code without executing it, leveraging reverse 
engineering methods. However, these techniques have demonstrated efficacy against 
known established malware; they fall short against novel variants and can be eas- 
ily evaded by obfuscation techniques. Dynamic analysis, on the other hand, entails 
observing and analyzing the runtime attributes of malware applications during code 
execution. This approach assesses behaviors to decipher malware functionality, in- 
cluding information flow tracking, function call monitoring, and instruction tracing 
[91]. Virtual environments and emulators are commonly employed for dynamic anal- 
ysis and data collection. Although this methodology effectively identifies unknown 
malware, it is time-intensive and demands substantial computational resources. Dy- 
namic analysis has also been extended to network traffic to identify malware that 
executes attacks via network pathways towards remote targets [92]. Network traffic 
traces can be detected by analyzing behavioral patterns in such cases. 

Figure 3.1.b depicts malware detection strategies. This refers to the placement 
strategy and detection approach for detecting and identifying malware. The place- 
ment strategy determines whether the system is implemented on a host or in the 
cloud, thereby determining its efficiency against complex code variants while uti- 
lizing limited computational resources. Malware detection approaches describe the 
methods and algorithms employed to detect and identify malware. However, their 
efficiency relies on the availability of extensive and diverse datasets. Data privacy 
concerns and shortages pose significant challenges when deploying cloud-based and 
deep-learning-based security solutions. 

Several studies on Android malware detection have been proposed and discussed 
[92, 93, 94, 95, 96], encompassing a range of ML and DL approaches and utilizing 
various malware analysis techniques and corresponding features. However, the dis- 
course has not extensively explored the use of DL for malware detection, explicitly 
exploiting the predictability of network behavior. Moreover, several additional con- 


straints have been identified but are not commonly addressed in these discussions, 
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such as limitations in computing resources, insufficient training data availability, and 


privacy concerns [3]. 


3.3 Model Development and Experiments 


Our approach involves three main steps: selecting and processing relevant network 
data, training Federated Convolutional Neural Networks (FedCNN) to detect mal- 
ware, and evaluating the results of this approach considering various performance 


metrics and settings. 


3.3.1 Dataset Selection and Processing : 


Deep-learning-based malware detection relies significantly on the quantity and qual- 
ity of training data. Increased availability of high-quality data leads to higher ac- 
curacy and improved results. For this study, we opted for the AAGM dataset (An- 
droid Adware and General Malware), renowned for its diverse collection of malware 
samples [97]. This dataset encompasses 1500 benign app samples and 400 malware 
samples categorized into 10 families, comprising 5 adware and 5 general malware 
families. To capture significant network traffic behavior, the authors deployed these 
samples on actual smartphones and executed user-interaction scenarios. The dataset 
includes 471,597 instances of benign behavior and 160,358 instances of malware be- 
havior, accompanied by 80 network traffic features encompassing flow-based, time- 
based, and packet-based attributes. These features were employed to differentiate 
Android malware behavior from benign applications. 

An essential step before training involves exploratory analysis and data prepro- 
cessing on the selected dataset to address various issues. Initially, we eliminated 
five null features, namely ‘flow_urg’, ‘furg_cnt,’ ‘burg_cnt,’ ’flow_ece,’ ’flow_cwr,’ 
and recognized their potential adverse impact on model performance. Additionally, 
four nearly null features were removed: ‘bAvgBulkRate,’ ‘bAvgBytesPerBulk,’ ’bAvg- 


PacketsPerBulk,’ and ’std_idle.’ Subsequently, we pruned redundant instances and 
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those with missing values. Following this, the data underwent normalization. The 
dataset was then divided using the hold-out validation strategy; the dataset was then 
divided, allocating 80% for training and 20% for testing. In the Federated Learning 
(FDL) context, a noteworthy portion (80%) of the training data was further distributed 
to participating clients. 

Figure 3.2 illustrates the dataset class distribution after the preprocessing step, 
utilizing the t-SNE technique [98]. The t-SNE technique is paramount in visualizing 
high-dimensional data in lower-dimensional space. It is particularly useful for un- 
derstanding the underlying structure and relationships within complex datasets, as it 
aims to preserve the pairwise similarity between data points during the dimension- 
ality reduction process. By applying t-SNE, we can gain insights into how the pre- 


processing steps have impacted the dataset’s distribution and separability of classes. 
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FIGURE 3.2: Exploring the High-Dimensional AAGM2017 Dataset Using 
the t-SNE Technique [98]. 
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3.3.2 FedCNN for Malware Detection 


Our detection methodology employed Convolutional Neural Networks (CNNs), a 
significant and specialized deep learning approach for data processing. These net- 
works use a unique architecture that consists of multiple convolutional layers strate- 
gically crafted to extract essential spatial features that are crucial for accurate decision- 
making by the model. These features are pivotal in enabling the model to make well- 
informed decisions. A notable aspect of CNNs is their composition, which involves a 
sequence of convolutional layers using a mathematical operation called convolution. 
This operation allows the network to capture intricate patterns within the data ef- 
fectively. Additionally, the network encompasses processing perceptron layers adept 
at effectively managing extensive-scale malware attacks. Table 5.2 demonstrates our 
CNN model architecture. 

Our FedCNN approach harnesses the capabilities of CNNs in conjunction with 
the decentralized nature of Federated Learning (FL), enabling collaborative model 
training across distributed data sources while ensuring privacy protection. To address 


this, we formulate the FL optimization problem as follows: 


¢ Device Sampling : This involves selecting participating Devices from a dis- 
tributed network, each with its own private local dataset. In this study, these 
datasets were derived by sampling from the training set of the main dataset. We 
ensured these sampled datasets were identically distributed (IID), preserving 
the same feature vector. Typically, the selection process ensures client diversity 
and representation, considering factors like data distribution, device capabili- 


ties, and connectivity. 


¢ Local Training : After device sampling, we proceed with local model training 
using the selected clients and their corresponding resources, including data and 
computational capabilities. The objective is to update the parameters of indi- 


vidual models to minimize the local loss function associated with their data. 
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Mathematically, for each client k, the local training seeks to find the optimal 


model parameters 6; that minimize the local loss L;.(0,): 


0; = arg min Ly (Ox) (3.1) 
k 
ge 
= argmin— 5.) [yilog (f(38,)) + (1—ys)log (1 flx6))] 62) 
k i=1 
where: x; denotes the input data sample, y; is the associated true label (0 as 
Benign or 1 as Malware), f(x;;0,) is the model output with parameters 0; for 


input x;,and N,; signifies the count of data samples within client k’s local dataset. 


During this phase, clients utilize their local data to update their models, captur- 


ing domain-specific patterns and information. 


Model Aggregation : The central server initiates the model aggregation step fol- 
lowing the local training phase. The objective is to combine the knowledge from 
individual clients’ models to create a global model that benefits from collective 
intelligence while preserving data privacy. This involves weighted averaging of 
the model parameters from the selected clients. The aggregation process can be 
mathematically expressed as: 

K Np 


global _ UK |g 
6 du not 


Where overall loss across all clients’ datasets can be expressed as: 
K 
Nx 
eae Oe) 
min YE Lal) 


Here, Nx represents the size of the dataset at client k, and N is the total number 


ggiobal 


of samples across all clients. The resulting global model, , represents a 


consensus reached by aggregating the insights from diverse sources. 
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Figure 3.3 illustrates the organizational chart of our FDL-based Android malware 


detection method, which comprises the following steps: 


1. The Server takes the lead by initializing the global model’s architecture along 
with essential global parameters, including the learning rate, the local training 


epochs, and the local batch size 


2. The server transmits this comprehensive information to pre-selected clients. These 
clients are chosen based on their resource availability and the presence of suffi- 
cient training data. The interaction between the server and the clients operates 


asynchronously. 


3. Each client independently engages in multiple local training epochs using the 
provided model. Subsequently, the client computes updates specific to its dataset 
and training progress. These computed updates, representing the new model 


parameters, are then returned to the server. 


4. Having gathered these updates, the server proceeds to update the global model. 
Once this update is completed, the cycle repeats, encompassing steps 2, 3, and 


4; as an iterative process, the global model converges to an optimal state. 


5. The server evaluates and maintains the final version of the global model for 
future optimized models intended for future deployment in malware detection. 
Based on its individual performance, each participating client independently 
preserves any relevant global model states throughout the FedCNN training 


process. 


3.4 Results and Discussion 


The efficacy of the presented FedCNN approach for Android malware detection was 


systematically evaluated within the controlled environment of Google Colaboratory, 
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FIGURE 3.3: Flowchart: FDL-Based Detection of Android Malware. 


using PyTorch library and GPU hardware accelerator for enhanced computational 
efficiency. A detailed overview of our experimental setup is outlined in Table 5.2. 

To validate the performance of the proposed FedCNN approach, a comparative 
analysis was conducted against a centralized alternative. This alternative employed 
the same CNN model architecture and training configurations. A series of experi- 
ments were thoroughly conducted, during which hyper-parameters were fine-tuned 
to ensure a detection model characterized by precision and generalization. 

In the presented comparative analysis (Table 3.2), we examine the performance of 
our proposed FedCNN for Android malware detection method in contrast to other 
relevant works. The assessment is carried out on the AAGM2017 dataset, employing 
key evaluation metrics such as Accuracy (Acc), Precision (Pr), Recall, Fl-score, and 
Support. Notably, the evaluation settings in the related works differed markedly, en- 
compassing distinct validation strategies and variations in the distribution of train- 
ing and test samples (Support). Considering the centralized CNN model, the re- 
sults demonstrate varying levels of performance. In this regard, the centralized CNN 


achieved an accuracy of 84% for malware detection, with precision and recall values 


Chapter 3. Federated Learning for Android Malware Detection 47 
Subject Parameters Values 
Convld-1 [1, 64, 70] 
Conv1d-2 [1, 32, 70] 
CNN() Convi1d-3 [1, 16, 70] 
Linear-4 [1, 32] 
Linear-5 [1, 2] 
Learning rate 7 0.001 
Loss function CrossEntropyLoss 
Activation function ReLU 
Batch size 126 
Classification function SoftMax 
Clients Sets [10, 20, 40] 
Data Distribution IID 
FedL() Local epochs [2, 3] 
Total rounds 30 
Local Batch size 32 
TABLE 3.1: Experimental Settings for FedCNN. 
Reference Classes Acc Pr | Recall | Fl-score | Support 
lashkari et al. 2018 [97] | Benign + Mal | 0.91 | 0.91 | N/A N/A N/A 
— Benign N/a | 0.95 8000 
andresini et al. 2021 [99] Walware 0.89 0.66 0.71 3000 
Benign N/A 
achanyecet al 20721001: —valware= | NVA) 097 1096.1 097 1915 
Centralized Benign 0.84 0.87 0.89 0.88 41877 
Cnn Malware : 0.78 0.76 0.77 22408 
Proposed FedCNN Benign 0.837 0.85 | 0.91 0.88 41877 
approach Malware : 0.80 0.71 0.75 22408 
Acc: Accuracy, Pr: Precision, Support : Number of test instances. 
TABLE 3.2: Performance Comparison Between Our Proposed Detection 
Method (FedCNN) and Other Related Approaches Using the AAGM2017 
Dataset. 
Total clients Round one Round 10 
Best client | Worst client | Global model | Best client | Worst client | Global model 
K=10 68.17 66.31 60.95 83.07 82.16 83.74 
K=20 69.01 66.45 68.27 82.14 81.74 82.27 
kK =40 65.79 63.6 65.34 78.05 76.38 78.47 
TABLE 3.3: Results of Accuracy Evaluation for the Proposed FedCNN 


Approach. 


of 78% and 76%, respectively, resulting in an F1-score of 0.77%. Similarly, for the pro- 


posed FedCNN approach, the accuracy reached 83.7%, with a precision of 80% and 
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Train vs Test accuracy using centralized approach 


Train vs Test loss using centralized approach 


Accuracy% 
f 
Loss 


—+— Train accuracy 
-e- Test accuracy 
n n 


0.6 
0.55 


0.5 | 
0.45 


T r r 
—— Train loss 
—— Test loss 


0 50 100 150 200 250 300 350 400 450 500 


Time(sec) 
Train accuracy using FDL approach 


0 50 100 150 200 250 300 350 400 450 


Time(sec) 
Train loss using FDL approach 
—— 


84} 


80} 
78} 


Accuracy% 
f fl 
Loss% 


75) 


—e with 10 clients 
— with 20 clients 
-e- with 40 clients 

———— 


0.6 F 
0.55 | 
0.5 
0.45 


——————— 
—+- with 10 clients 
—+ with 20 clients 
—e- with 40 clients 


0 20 40 60 80 100 120 140 160 180 200 220 240 
Time(sec) 


0 20 40 60 80 100 120 140 160 180 200 220 240 
Time(sec) 


FIGURE 3.4: Analyzing Model Accuracy, Loss, and Time Complexity 
Across Various Training Approaches. 
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FIGURE 3.5: Confusion Matrix: Insights and Outcomes. 


a recall of 71%, yielding an F1-score of 75%. Furthermore, FedCNN effectively clas- 


sified instances of the "Benign" class, corresponding to normal applications, with a 


recall rate of 92%. In contrast, the "Malware" class, including all 10 Android malware 


families, achieved a detection rate of 71%. 


The results highlight the efficiency of the FedCNN approach, as it achieves a per- 


formance level nearly on par with the centralized approach. Nonetheless, it’s worth 


noting that the outcomes of both detection methodologies fall short of meeting the 


requirements for real-world application. This is primarily attributed to the elevated 
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incidence of false positives and false negatives, as depicted in Figure 3.5. 

Figure 3.4 compares model accuracy, loss, and time complexity across distinct 
training approaches. The time complexity analysis highlights the effectiveness of 
the proposed FedCNN approach. However, it’s noteworthy that with an increased 
number of participating clients, the global model’s accuracy declined from 83.74% to 


78.47%, as illustrated in Table 3.3. 


3.5 Chapter Summary 


This chapter examines the effectiveness of an FL paradigm and network behavior 
analysis for malware detection, focusing on privacy preservation, computation cost, 
and detection efficiency. The analysis employed network layer features of malware 
samples to identify variations from their normal behavior. Experimental results demon- 
strated the efficiency and effectiveness of the proposed FL using a CNN approach 
compared with conventional centralized methods in terms of computation cost and 
privacy protection. However, the detection efficiency was inadequate when consid- 
ering only network-based statistical features. Additionally, this analysis is confined 
to sets of malware that require network connectivity and exhibit abnormal network 
behavior. 

Future research in malware detection could incorporate diverse sources of mal- 
ware behavior data and ensemble learning to enhance detection capacity. In contrast, 
our primary focus centers on evaluating our proposed federated learning-based IDS 
methodology using recent Industrial IoT datasets while addressing other significant 
challenges and the associated security concerns, as discussed in Section 2.5. 

Therefore, the next chapter will refine this approach by introducing an improved 
privacy-preserving FL, incorporating non-identically distributed data (non-iid), and 


establishing a secure IDS framework to counter the associated security risks. 
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CHAPTER 4 re 
| ss PRIVACY-PRESERVING SECURE SYSTEM FOR 
INDUSTRIAL IOTS 


“Tt is a capital mistake to theorize before one has data. Insensibly one begins to 


twist facts to suit theories, instead of theories to suit facts” 
— Sherlock Holmes 


4.1 Introduction 


Drawing upon the insights acquired from the preceding chapter, FL-based IDS of- 
fers enhanced computation efficiency in deploying ML and DL approaches, outper- 
forming centralized methods without compromising sensitive data to privacy issues. 
However, FL trains the models locally and transfers the updates to the centralized 
server for aggregation. Consequently, intruders or untrusted participants can com- 
promise the quality of model updates and data privacy by exploiting inference attacks 
[101]. Furthermore, the centralized aggregation point presents a significant vulnera- 
bility as it functions as a single point of failure. This has given rise to new challenges, 


including establishing a reliable framework for secure aggregation and validation of 
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uploaded updates, addressing issues of system unreliability, and ensuring the safe- 
guarding of privacy during the model uploading process. 

To address these challenges, this chapter introduces an innovative and privacy- 
preserving secure framework named PPSS, which leverages the potential of blockchain 
technology by implementing a lightweight consensus protocol to optimize and se- 
cure the process of FL across untrusted participants. The effectiveness of PPSS is 
thoroughly assessed using a recent Industrial cyber security dataset (Edge-IIoT). A 
comprehensive set of key metrics, including detection rate, accuracy, computational 
efficiency, and energy consumption, is employed to evaluate the framework. Further- 
more, this evaluation encompasses both non-IID and IID data distribution modes. 

The remainder sections of this chapter are organized as follows: Section 4.2 pro- 
vides an overview of the subject matter and the design objectives of PPSS. Section 4.3 
discusses the development and experimental aspects of the PPSS framework, exam- 
ining its components and algorithmic insights. It covers critical topics like component 
interaction, blockchain-enabled federated learning, secure communication, key man- 
agement, proof of federated deep learning, and blockchain security analysis. Section 
4.3.3 discusses PPSS-enabled cyber threat detection, including dataset selection, meth- 
ods, and experimental settings. Section 4.4 analyzes PPSS performance across various 
scenarios, including class-specific, data distribution, global model accuracy, conver- 
gence time, differential privacy training, energy cost, and blockchain performance. 
Finally, Section 4.5 summarizes the chapter, providing a comprehensive overview of 
key findings and insights on the Privacy-Preserving Secure System for Industrial loT 


and its multifaceted aspects. 


4.2 Design Objectives of PPSS 


The Industrial loT brings numerous benefits to industries, such as increased efficiency, 
predictive maintenance, and real-time monitoring. However, it also introduces signif- 


icant security risks and data privacy challenges. Industrial organizations must adopt 
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appropriate security mechanisms and strategies to effectively mitigate these potential 
cyber threats. Collaborating to implement a security monitoring mechanism like an 
IDS benefits industrial organizations by fostering shared threat intelligence, reducing 
costs, improving incident response, and collectively enhancing the security posture 
of the industry as a whole. In this context, cross-silo FL has emerged as a promis- 
ing approach to address the unique challenges posed by the IIoT, including resource 
constraints and data privacy issues. However, the potential for malicious activity 
introduces concerns of model poisoning, privacy breaches, and intellectual property 
theft, while unintended privacy leakage can occur during aggregation. Secure agegre- 
gation protocols, model verification mechanisms, and trust-based systems must be 


employed to address these issues. 


fe) fe) < soos 
a Gam Placing the proof of elegibility Delegated investors 


Permissioned Lo 


—— blockchain 
1) Initialize 8 Request last 6 Local Learning 


learning task block updates 


& e @ 
Genesis block , block 144 J 4 & & & 
: . 


Send Global 


Task Publisher Learning Chain : Second stage aggregation model updates Na “EE 


Compute local models 


First stage local aggregation 


(9) Stop Verify & Add 
learning task winning blocks 


8 Secure & efficient aggregation 


fo) o oO Tove new DLOC 
Gm . (7 es block 


Validators Prover 


FIGURE 4.1: PPSS: An Overview of Our Proposed Blockchain and Feder- 
ated Learning for Industrial IoT. 


By navigating these challenges effectively, we designed a privacy-preserving se- 
cure system named PPSS for collaborative IDS across industrial organizations. Our 
PPSS security framework employs permissioned blockchain as a trust framework 


that verifies the identities of participating organizations to secure FL and multi-party 
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computation. Leveraging blockchain-distributed architecture, data is encrypted and 
transmitted using authenticated and private peer-to-peer (P2P) channels, allowing 
each organization to retain control over its data. This prevents unauthorized access, 
even if communication channels are compromised. Furthermore, the blockchain’s 
design is customized to facilitate model sharing among participants, using cryptocur- 
rency to reward and host qualified models and to encourage participant involvement 
and engagement in this collaborative environment. 

PPSS incorporates two distinct federated stages for model aggregation to foster 
cross-silo FL-based IDS. The initial stage involves aggregating models across devices 
within an organization. Subsequently, the second stage occurs between participants, 
facilitated by the blockchain’s utilization of model-containing blocks named the Learning- 
Chain. This enables the secure exchange of threat intelligence, enhancing the collective 
ability to detect and respond to emerging cyber threats. 

At the core of PPSS, we incorporate a validation process for local training results, 
acting as a consensus mechanism within the blockchain. This mechanism, termed 
Proof-of-Federated deep learning (PoFDL), enhances privacy, reliability, and trans- 
parency. Figure 4.1 demonstrates the blockchain-based learning process of the pro- 
posed PPSS security framework, which enables secure communication and validation 
of the model updates. The chart illustrates how the organizations collaborate and con- 


tribute to the federated learning process while maintaining privacy and security : 


1. Initiating Learning Task: Task publishers propose the learning process by creat- 
ing a Smart Contract (SC) that defines the learning task (initial model, rewards, 


terms and conditions). 


2. Investor Applications: Investors interested in participating submit applications 


to undertake specific tasks by providing proof of eligibility. 


3. Allocation of Terms: Administrators review applications and assign predefined 


terms and conditions to eligible investors. 
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4. Delegated Federated Learning (FL): Selected investors become delegated par- 
ticipants and initiate Federated Learning (FL) tasks. Importantly, FL tasks are 


carried out without sharing raw data. 


5. Prover Supervision: Each investor, acting as a Prover, moderates the FL task for 
their respective devices. They ensure the integrity of aggregation, transaction 


verification, and block generation processes. 


6. Global Update Transmission: Provers share global updates with their devices, 
prompting subsequent rounds of federation. The focus here is on continuing the 


process, not yet on resolving the Proof-of-Federated deep learning (PoFDL). 


7. PoFDL Resolution and Block Generation: The PoFDL challenge is resolved 
once conditions are met. The Provers create a new block containing the validated 
information and broadcast it to Validators, whose role is to validate the block’s 


contents and reach a consensus. 


8. Learning-Chain Inclusion: Upon consensus, the validated block becomes part 
of the Learning-Chain, ensuring a secure and tamper-proof record of the learn- 
ing process. Both Provers and Validators are rewarded proportionally for their 
contributions. This mechanism encourages secure transfer learning among all 


participants. 


It’s worth noting that the task publishers and the Provers may either be intrinsic com- 
ponents of the blockchain system or external entities. In contrast, the Validators repre- 


sent the trusted blockchain maintainers. 


4.3 Framework Development and Experiments 


This section presents the design scheme of our proposed PPSS security model. Fig- 
ure 4.2 showcases the workflow of PPSS, illustrating its application in collaborative 


model training within Industrial IoT networks. The primary objective of this system 
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is to facilitate the training of a DL model using a federated approach. This process 
involves two distinct stages: local aggregation and global aggregation, referred to as 
off-chain and learning-chain aggregation. These stages are overseen by the Prover 


and Validator nodes, respectively. 
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FIGURE 4.2: PPSS Security Model for Industrial loT Networks: Overview 
of Architectural Framework and System Components. 


Operating at the edge layer, Figure 4.2.a demonstrates the off-chain FL, which in- 
volves local model training within each participating organization, overseen by au- 
thorized representatives known as Provers (P). The aggregated local model that re- 
sults from this process is then sent back to the clients for more communication rounds 
if the model does not meet the criteria for global aggregation. 

In contrast, the Learning-Chain operates within the fog layer and employs the permissioned- 
blockchain technology to facilitate the sharing of models and updates among organi- 
zations. The blockchain functions as a distributed ledger, recording model updates 
in blocks encompassing parameters and additional information like the origin orga- 
nization and timestamp. In this paradigm, Validator entities (V) play a pivotal role 
as trusted maintainers of the Learning-Chain, ensuring the integrity of the in-chain 


FL process as demonstrated in Figure 4.2.b. Furthermore, a consensus mechanism is 
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implemented among VY nodes to validate block data, ensuring coherence across the 
blockchain’s distributed ledger. A detailed discussion about blockchain integration 
in Section 4.3.2 

This approach ensures the security and transparency of model updates, establish- 


ing a trustworthy and auditable account of the federated learning process. 


4.3.1 Overview of Component Interaction and Algorithmic Insights 


Notation Description 

C Gradient norm bound 

I Identity matrix 

D Local dataset 

E Local training epochs 

k Global security parameter 
m Message 

Ww Model weights 

o Digital signature 

SC Smart Contract 

shy Ephemeral symmetric key 
Psk Prover secret key 

P pk Prover public key 

Csk Client secret key 

Cpk Client public key 

a Learning rate 

0) Noise scale 

(€, 0) Privacy cost 

g(x;) Gradient computed on x; 
Gm Global model 

Txs Transactions building the global model 


aggregate(.) 
SC.aggregate(.) 


Aggregate models by averaging (FedAvg) 
Aggregate models using smart contract 


Sign(.) Digital signature function 

Verify(.) Verify digital signature function 
Encrypt (.) Symmetric encryption function 
Decrypt(.) Symmetric decryption function 

Ty Transaction 

Proof Performance metric (e.g., Accuracy) 
TestSet Cross-validation dataset 


TABLE 4.1: Notation for Algorithm Discussion. 
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Algorithm 2: Secure Aggregation in PPSS. 


1 Validator Nodes : Function validate_Blocks (Learning — Chain) 

2 | Validate submitted new model-containing blocks 

3 | Achieve consensus and add model-containing blocks to Learning-Chain 
(refer to Algorithm 3) 

4 Prover Nodes : Function offChain_FL (Learning — Chain, TestSet) 

5 Initialize model W from Learning — Chain 

6 for each round t = 1 to Rdo 

7 S; <— Random subset of clients k 

8 m < Encrypt(W, shx) 

9 fork € S; in parallel do 


10 mt,0 < Edge_client (m, Sign(m, Psk)) 
11 if Verify(m;:, 7) then 

12 | Wha < Decrypt (mp.wz, shr) 

13 en 

14 end 

15 Wi+1 + aggregate(W;,,,We4,-.-,Wh,1) 
16 Proof «+ predict(W;+1, TestSet) 

17 if Proof > Learning — Chain.Proof then 
18 Gu, Txs + SC.aggregate(m',m?,... me) 
19 Submit new_Block(Gy, Txs) 

20 end 

21 end 


22 Edge-clients : Procedure local_Training (m, 0c, Params) 

Input: a, £,¢,C 

Output: Privacy-preserved, signed, and encrypted model update 
23 if Verify(m,c) then 


24 Wy < Decrypt(m.w, shx) 

25 end 

26 for batch of samples B; in D do 

27 Compute gradient g 

28 fori € B; do 

29 Compute gj(xj) — Aw/J(w;, xi) 

30 Clip g<+g 

31 Bi(xi) ¢ Ee 
max(1,—~—) 

32 Add noise 

33 8) — Ty Li Bi (21) + N07, C°1)) 

34 Descent 

35 Wi41 — Wi — ag 

36 end 

37 end 


38 | m << Encrypt(Weg,sh,) 
39 | Send (m,Sign(m,Csk)) to corresponding Prover node 
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Figure 4.2.a illustrates the procedural dynamics of blockchain-enabled decentral- 
ized FL (DFL), demonstrating comprehensive interaction among components through- 
out process interaction. Table 4.1 provides a comprehensive notation guide for algo- 
rithmic discussions. Algorithm 2 demonstrates the operational framework for how 
various entities interact to achieve secure model aggregation within the context of 


DEL: 


¢ Validator Nodes (V): As depicted in Function 1 of Algorithm 2, V nodes play 
a pivotal role as trusted authorities, dedicated to ensuring the integrity of the 
blockchain while facilitating the efficient storage of novel models within the 
Learning-Chain. They rely on the PoFDL algorithm (Section 4.3.2.2) to attain con- 
sensus regarding including new models. Once consensus is achieved, relevant 
participants are promptly notified, thereby initiating the transfer learning pro- 
cess by utilizing recently incorporated models. Moreover, the V nodes can also 
take on additional responsibilities as provers, leveraging their inherent data and 


computing resources to execute outstanding tasks. 


¢ Prover nodes (P) : Function 4 of Algorithm 2 illustrates the role of P nodes 
in orchestrating a decentralized FL (DFL) process and showcases P nodes’ dual 
function of coordinating localized FL updates and underscoring their contribu- 
tion to efficient collaborative learning and Learning-Chain integration. Through 
iterative rounds, P nodes engage a subset of clients, encrypting and processing 
their model updates. These updates are aggregated and used for prediction, 
with P nodes checking if the predictive Proof meets a preset threshold. Upon 
meeting this criterion, a smart contract aggregates the encrypted updates into a 


global model, which is subsequently packaged into a new block for submission. 


¢ Edge-clients : Function 22 of Algorithm 2 illustrates the role of Edge-clients in 
contributing to secure, collaborative learning through sophisticated privacy- 
preserving and local_Training techniques. The procedure includes input and 


output specifications, verification and decryption, gradient computation and 
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clipping, privacy-preserving gradient manipulation, parameter updating, and 
secure encryption and transmission. Input parameters like a, E, p, and C gen- 
erate a privacy-ensured, signed, and encrypted model update. The procedure 
guarantees the authenticity of the received encrypted model and decrypts it us- 


ing the shared key "shx," resulting in the localized model "W;,." 


Gradient computation and clipping occur for each batch of training samples, 
yielding individual gradients for each data sample. The privacy-preserving 
technique of gradient clipping involves scaling the gradient by a factor deter- 
mined by the gradient norm bound C. To enhance privacy, Gaussian noise is 
added to the clipped gradients, employing a zero-mean Gaussian distribution 
with a variance of ¢* scaled by C7I. The aggregation of gradients is performed 
by calculating the average of locally adjusted gradients across all samples in the 


batch. 


The process continues with the update of parameter weights using the aggre- 
gated gradient §;, effectively updating the model’s weights as w;,1 = w; — a§;. 
Lastly, the model is encrypted using the encryption algorithm E and the shared 
key “sh,.”, after which the encrypted model is sent to the corresponding P node, 


authenticated through the client’s private key. 


Our proposed PPSS secure system presents a sophisticated framework combining 
FL concepts and cryptographic principles to accomplish privacy-preserved, secure 
model aggregation. By outlining responsibilities, maintaining validation consensus, 
and prioritizing data privacy at every stage, PPSS serves as a promising strategy for 
enhancing collaborative machine learning within decentralized networks while safe- 


guarding the sensitive nature of individual data points. 
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Block Header Block Header 


Previous Block 
—_ 


Block Data Block Data 


Global Model Global Model 


FIGURE 4.3: PPSS Learning-Chain Data Structure: Encapsulation of In- 
formation within Blocks. 


4.3.2 Blockchain-enabled Federated Learning 


Figure 4.3 illustrates the core architecture of the PPSS Learning-Chain. Each block con- 
tains essential components such as the Block Index and Block Header (Hash, Times- 
tamp, Proof, Previous Block). The Block Data section houses a Signature with the 
client’s ID and block data, Transactions timestamped and hashed using registered 
client IDs, and the Global Model. Building transactions for the global model are metic- 
ulously time-stamped and linked to registered client IDs, enhancing transparency. A 
cryptographic Signature fortifies data integrity. This data structure ensures secure, 
accountable information storage in the PPSS framework. 

In the Learning-Chain network model, P nodes possess digital identities and ac- 
cess to the Learning-Chain for moderating local training and proposing new blocks to 
Y nodes for inclusion in the Learning-Chain. To accomplish this, P nodes utilize confi- 
dential smart contracts to securely aggregate updates from their respective clients. As 
elaborated in [102], this approach leverages Trusted Execution Environments (TEEs) 
to ensure the aggregation process’s security against unauthorized node manipula- 
tion. Then, P nodes generate blocks containing the approved model and necessary 


information resulting from successful smart contract execution. Subsequently, these 
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blocks are incorporated into the Learning-Chain during the subsequent global aggre- 
gation phase. 

During the global aggregation phase overseen by V nodes, blocks containing mod- 
els, as provided by P nodes, are incorporated into the Learning-Chain. This inclusion 
is accomplished through a consensus mechanism named Proof-of-Federated Deep 
Learning (PoFDL, Section 4.3.2.2). Once integrated, any authorized participant with 
access to the blockchain can request the latest model stored within the Learning-Chain. 


This model can then be utilized for deployment or further refinement purposes. 


4.3.2.1 Secure communication and Key management To establish secure end-to- 
end communication between proposed PPSS system nodes, we propose a combina- 
tion of asymmetric and symmetric encryption methods. AES is used for data encryp- 
tion with a shared ephemeral key, while RSA handles authentication using private- 
public key pairs. A central entity, the "Trust Authority" (A), manages the key genera- 
tion, distributing public keys as identities and retaining private keys. In the second- 
stage aggregation, A establishes a secure AES key between the P and V nodes. In the 
first stage, P nodes secure communication among corresponding edge clients using 
AES-based data encryption. This comprehensive approach safeguards the Learning- 
Chain’s integrity, protecting transmitted models from eavesdropping and cyber threats. 
The establishment of Learning-Chain framework consists of the following algo- 


rithms: 


¢ PSetup(1*): This algorithm takes 1* where k is the security parameter of the 


system and returns the description of bilinear groups € = (p,G,, G2, G7, e). 


e KeyGen(€): This algorithm selects two generators g; € G; and go € G2 witha 
random scalar x <~ Zy. It produces a public/private key pair (pk;,sk;) for the 
party invoking it, where pk; = (91,90, X,z), X © 9%,z € e( 21,8), and sk; = x. 


¢ KeyAgGen(€,.A): This algorithm selects a sequence of public keys A and then 
produces an aggregate public/private key pair (Apk;, Ask;). 
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¢ Sign(sk;,m): This signature algorithm takes a message m € [WR,TS, AUX], a 
private key sk;, and produces a signature 7 <— g : ae where AUX includes 
the training information (e.g., hyper-parameters), T'S is an optional collection of 
test data samples to evaluate model updates, and WR is a collection of interme- 


diate model weights registered during the training process. 


e Verif(m, pub;,o): This verification algorithm takes a signature 7, a message m © 
[WR, TS, AUX], and a public key pub;. The algorithm tests e(7, X - g") = z and 


returns true or false. 


Definition 1 (Learning-Chain correctness). Let O be a t-Learning-Chain scheme ini- 
tialized with € ~ PSetup(1"), KeyGen(€), and KeyAgGen(€,A), where € = (p,G, 
,Go,Gr,e). Let (pky,sk,),..., (pk;, sk;) be a sequence of keys generated via KeyGen(€). 
Let (Apk;, Ask;) be an aggregate public/private key pair generated via KeyAgGen(E€, A). 
Let m € [WR,TS, AUX] be a message, and let (pk;,01),..., (pki, 0;) be any sequence 
of key/signature pairs, where 0 ¢— o +") The O scheme is valid if, for every 


message and sequence, the following criteria are satisfied: 


¢ When the Verif (m, pub;,0;) algorithm tests e(0;, X - gi") = z, the result is true for 
all 7. 


The integration of Learning-Chain with FL comprises four distinct phases: 1) Ini- 
tialization phase, 2) agreement phase, 3) Model-Containing Block Generation Phase, 


and 4) PoFDL) Consensus Enabling Phase. 


¢ Initialization phase : This involves a process where the Trust authority (A) 
registers various types of nodes within the system: V nodes, which are trusted 
parties capable of validating and adding new blocks to the system; P nodes, 
which have limited capabilities and can only query the Learning-Chain and cre- 
ate new blocks; and Edge-clients owned by P nodes, serving as model trainers 
without querying capabilities. Each entity is provided a unique IID for authen- 


tication and a security parameter k for key generation. Additionally, P nodes 
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receive the initial model state Wo, learning parameters, and a data partition for 
constructing the Proof of Federated Learning (PoFDL) from the “Learning-Chain. 
Key generation operations, including KeyGen(€) — (pk;,sk;) for V nodes and 
Edge-clients and KeyAgGen(€, A) — (Apk;, Ask;) for P nodes using Edge-clients 
public keys, are performed. Furthermore, it’s important to note that P nodes 
risk losing their credentials and permissions if they fail to submit qualified mod- 


els according to smart contract conditions. 


e Agreement phase: When a participant publishes a learning task by provid- 
ing the initial information and conditions (i.e., initial model state Wo, labeled 
test_dataset TestSet, parameters) for P nodes who want to join the federated 
learning task. P must provide proof of eligibility, such as training performance 
using its resources, to adhere to the smart contract terms to contribute to the 
Learning-Chain. The selected eligible P nodes must register all corresponding 


Edge-clients to ensure authentication. 


¢ Model-Containing Block Generation Phase: A Prover P generates a block 
with corresponding Edge-clients’ updates. These updates are verified using the 
tamper-proof ledger of the Learning-Chain as a reference to identify malicious 
clients. Only valid updates are encapsulated as transactions (T,). Given two dif- 
ferent hash functions: H; : © x {0,1} — {0,1} and Hp: {0,1} x {0,1}* > Q. 
Given a secret key x;,y; € Z;, and a block Bloc; € {0,1}*, P picks: gi € Gi, 
a random number 7; € Z;, and computes o; = Hy (ger tear ys € {0,1}. 
Then P computes b; = Hp(0;, Blocj,r;) € O. and sets the time intervals of a block 


generation as T. The signature of a block Bloc; is (0;,b;,r;). Finally, P broadcasts 


transaction data combined with the signature to the blockchain V nodes. 


¢ PoFDL Consensus Enabling Phase: discussed in the following section 4.3.2.2 


4.3.2.2 Proof of Federated Deep Learning for Consensus Establishment: Inspired 
by the Proof-of-Authority (PoA) consensus mechanism, we develop the PoFDL to 
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complete verification and add new blocks to the Learning-Chain. However, in contrast 
to relying solely on pre-selected reputable nodes, we empower each P to become a V 
node in the PoFDL by staking a deposit of cryptocurrency or staking their reputation. 
This approach enhances trust levels among participants and strengthens blockchain 
immutability. Algorithm 3 describes PoFDL consensus-driven procedures, whereby 
a requisite number of V nodes confirm the validity of added blocks. 

Specifically, YV nodes maintain the Learning-Chain by adding new blocks. After 
a P generates a new block, it submits it to the corresponding mining authority for 
verification. This authority operates as the "Leader" for the subsequent block in the 
Learning-Chain. To equitably distribute the responsibility of block creation among 
validators, PoFDL implements a time-based mining rotation scheme, ensuring the 
selection of a single elected Leader at each time-step, as specified by the smart contract 
[103]. If the current leader fails to transmit a block within the allotted time, they must 
submit an empty block to uphold their reputation. 

Figure 4.4 illustrates the consensus process and message exchanges for block pro- 
posals based on PoFDL. The Leader broadcasts the received block to other validators 
for block acceptance. Each (V) evaluates the received model-containing block, broad- 
casts the results, and compares them with those of other validators to decide on block 
acceptance. The block validation process and consensus mechanism are depicted in 
Algorithm 3. The block is added to the chain if: (1) the Leader is the one anticipated 
to be the current leader, and (2) at least a + 1 Validators received the same block and 
confirmed its acceptance. 

In contrast to prior proof of learning concepts [104], our proposed PoFDL uses the 
inference phase to validate learned models, ensuring computational efficiency and 
data privacy. This approach involves: 

(i) Each Prover P receives the same TestSet of labeled data to prove new models. 

(ii) Each Validator V; € N is allocated a distinct partition of labeled data D; € 
ValidationSet during the initialization phase. 


The following relationships constrain these conditions: 
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FIGURE 4.4: Consensus Process and Message Exchanges in PoFDL. 


Algorithm 3: Validators: Validate and Add New Block. 
1 Validators Function ValidateAndAddBlock(Block, previous_hash, D*) 


au Fr WN 


NX 


10 
11 


12 
13 
14 
15 


16 
17 


Input : Block, previous_hash, Dependent Validation data: ID* 

Output: Add valid block: Success or Fail 

for Validator V € N do 

if Sender is current Leader then 

if Authenticated and Valid (Block.signatures) then 

Valid_proof <— predict(Block.model, D*) 

if Average(Valid_proof, Block.Proof) > Block[previous_hash].Proof 
then 
i Y Broadcast m(Block.hash, Success) to Validators 


else 
iz Broadcast to all (Block.hash, Fail) 


else 
| Broadcast to all (Block.hash, Fail) 


if Each V € N upon receiving at least N/2 + 1 m(Block.hash, Success) then 
Reaches consensus and confirms adding full Block data to Leader 
else 

Sends a penalty to Leader 
if Leader upon receiving at least N/2 + 1 Confirm adding (Block.hash) then 
It stores the current block 


TestSet  ValidationSet = © (4.1) 
ValidationSet = (Dj , D2,...,Dn) (4.2) 
V(Dj)icr C ValidationSet; with Card I = _ () D,;=@ (4.3) 


ie] 
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To maintain integrity, at least N/2 validation data partitions are required to be 
mutually exclusive. This requirement is to avoid fraudulent actions by Provers and 
Byzantine nodes. As depicted in Figure 4.4, P functions as the current Leader node, 
and if qualified, as V in subsequent rounds. The Leader proposes model-containing 
blocks to V nodes. After evaluating and broadcasting results, the system reaches 
consensus with at least 4 +1 agreeing V. Validated blocks are stored in the Learning- 
Chain, and both Leader and P receive cryptocurrency rewards. 

This innovative approach consensus safeguards against malicious activities, en- 
hances efficiency, and guarantees the systematic incorporation of validated model 


updates into the collaborative learning process. 


4.3.2.3 Blockchain Security Analysis 


1. Sybil attack: This attack undermines the decentralized nature of the network by 
creating a large number of pseudonymous identities to gain influence. Then, the 
attacker can manipulate consensus mechanisms or execute malicious actions by 
leveraging their extensive control over these fake identities. To oppose these 
nodes, our PPSS design only admits Prover nodes with positive reputations 
earned by contributing positively to the learning environment. This ensures that 
only trusted and reliable participants can participate in the federated learning 


process. 


2. Byzantine attacks: Referred to as Byzantine nodes, these entities intentionally 
deviate from the established protocol to disrupt consensus mechanisms and 
compromise the integrity of the blockchain. To prevent attacks, our PPSS de- 
sign allows trusted Validator nodes to collaboratively detect malicious nodes 
during their leadership using a voting mechanism. A leader node can be voted 
as malicious and subsequently removed based on the following scenarios: 

(i) failing to propose any blocks; (ii) overstepping the expected number of pro- 
posed blocks (Denial of Service attacks); or (iii) presenting varying blocks to 


different authorities. 
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3. Model inversion and membership inference attacks: To mitigate these attacks, 
especially in honest-but-curious scenarios, our PPSS framework incorporates a 
differential privacy-enhanced training mechanism. By differentiating parame- 
ter gradients of client models during training, our PPSS framework ensures that 
sensitive information about individual data records cannot be inferred from the 
model parameters. Furthermore, our framework employs encryption and au- 
thentication mechanisms to protect model sharing from the public. This further 
enhances the privacy and security of the FL, preventing unauthorized access to 


the models and reducing the risk of inference attacks. 


4. Model theft attacks: This pertains to the scenario where a consensus node 
pilfers a trained model upon receipt for validation from other Provers, subse- 
quently asserting ownership by re-broadcasting it to other consensus nodes, like 
with a replay attack. To prevent these attacks, we impose two security mea- 
sures. Firstly, we require that a Prover node incorporate updates from inter- 
mediate clients into building transactions for the global model within the block 
data (Figure 4.3). These transactions are provided with timestamps and hashed 
using matching registered client IDs on the blockchain. Secondly, we require 
that a Prover include a signature in the block data containing their ID and the 
block data itself. This makes it difficult for an adversary to falsify block data and 
rebroadcast it. Moreover, this approach alleviates the communication overhead 
from messages between Prover nodes and Validator nodes during investiga- 


tions. 


By integrating these measures, the PPSS framework fortifies the security and in- 
tegrity of the model-sharing process, rendering it resilient against various attack sce- 


narios. 
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4.3.3 PPSS-enabled Cyber Threat Detection 


To safeguard industrial networks against large-scale and emergent cyber threats, such 
as cloud security weaknesses exposing sensitive data, ransomware attacks encrypting 
critical data, DDoS attacks causing operational disruptions, lof device vulnerabili- 
ties leading to unauthorized access, insider threats, and Advanced Persistent Threats 
(APTs) maintaining stealthy network access and privacy breaches, we propose the 
implementation of a two-tiered security approach. This approach combines the PPSS 
framework with an anomaly and deep learning-based IDS, bolstering the network’s 
overall security posture. 

The PPSS framework serves not only to enhance privacy and security but also 
facilitates a decentralized deployment of the IDS system. In this arrangement, detec- 
tion nodes receive frequent updates of efficient and reliable detection models. These 
models are trained across extensive networks with minimal cost. This improves the 


scalability and efficiency of the IDS. 


4.3.3.1 DataSet Selection and Processing: Our study employs the recently pro- 
posed EdgelloTSet dataset for evaluation [33]. This dataset comprises a realistic repre- 
sentation of Industrial IoT environments, a comprehensive feature set, diverse attack 
scenarios, and suitability for FL-based IDS evaluation. 

The following considerations illustrate why this dataset is an appropriate candi- 


date for assessment of our proposed PPSS-enabled IDS : 


1. Realistic Environment Representation : The dataset is specifically tailored for 
IoT and IloT security research. it was created by modeling and emulating ac- 
tual industrial systems in real-world IloT environments, imparting a realistic 


representation. 


2. Comprehensive Feature Set : The dataset encompasses extensive features from 
diverse sources, including alerts, system resources, logs, and network traffic. 


These features provide a rich source of information for training and enhancing 
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an FL-based IDS. The dataset contains over 10 million normal records, 9 million 


malicious, and 67 features. These records are collected from device and alert 


logs across a network of seven interconnected layers, which include the cloud /- 


fog computing layer, Blockchain layer, SDN layer, edge layer, and IoT/IIoT per- 


ception layer. The dataset also covers a range of related protocols, including 


industrial protocols like Modbus and MOTT. 


3. Variety of Attacks : The dataset encompasses attacks relevant to IloT connectiv- 


ity protocols, systematically categorized into five threat categories as depicted 


in Table 4.2. These threats encompass a wide range of 15 class attack types, 


comprehensively representing the cybersecurity challenges in IoT and IIoT ap- 


plications. 


Attack Category 


Malware Attacks 


DoS/DDoS Attacks 


Information Gathering 


Man-in-the-Middle 


Injection Attacks 


Description 


These attacks involve the installation of backdoors or malicious 
programs on IoT devices or edge servers. This category covers 
attacks like Ransomware attacks and Backdoor attacks. 


These attacks are intended to render the victim’s IoT edge server 
inaccessible to legitimate requests. This category encompasses at- 
tacks such as TCP SYN Flood DDoS attack, UDP flood DDoS at- 
tack, HTTP flood DDoS attack, and ICMP flood DDoS attack. 


These attacks involve the analysis of loT data packets to identify 
vulnerabilities in loT devices and edge servers. This category en- 
compasses attacks like Port Scanning, OS Fingerprinting, and Vul- 
nerability Scanning Attacks. 


These attacks involve the interception of communications be- 
tween IoT devices and edge servers. This category includes at- 
tacks such as ARP Spoofing attacks and DNS Spoofing attacks. 


These attacks involve sending malicious scripts to unsuspecting 
users, allowing the attacker to gain access to sensitive information. 
This category encompasses Cross-site Scripting (XSS) attacks and 
SQL Injection. 


TABLE 4.2: EdgelIoTSet: Attack Categories and Descriptions. 
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These alignments are with the focus of our research on an FL-based IDS for IloT 
as a reliable and representative dataset when evaluating IDS within the complex and 
dynamic settings of IloT applications. 

The processing of the EdgelloTSet dataset involves detecting and rectifying cor- 
rupt or inaccurate records by eliminating duplicates and missing values. Main flow 
features, such as IP addresses, ports, timestamps, and payload information, are ex- 
cluded. Furthermore, categorical variables undergo conversion into one-hot encoded 


feature variables [33]. 


4.3.3.2 PPSS Detection Method: In our IDS detection methodology, we’ve chosen 
anomaly-based detection. This method allows IDS to continuously monitor and cat- 
egorize various behaviors, enabling timely identification of potential cyber threats. 
Moreover, this method has proven effective in identifying unknown attacks, includ- 
ing Zero-day attacks. 

We leverage this method by employing convolutional neural networks (CNNs) as 
the foundational detection module [105]. Within the domain of DL, CNNs occupy 
a significant position as a distinctive model. Figure 4.5 depicts our proposed CNN 
model adopted within the Privacy-Preserving Secure System (PPSS) framework. The 
architecture of CNNs comprises interconnected convolutional layers that serve as in- 
formation extraction modules. These layers employ learnable filters denoted as pa- 
rameters W; these layers employ learnable filters denoted as W, applied to input data 
X, resulting in feature maps F through convolution represented as F = W x X. Subse- 
quently, pooling operations reduce feature map dimensionality. For instance, the max 
pooling is defined by : 


Pmax(F)i,j = bee Finn (4.4) 
, ij 


. After a convolutional layer with pooling and activation, denoted as : 


Fout = 7(P(W * Fin)) (4.5) 
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CNNs effectively identify localized patterns, capturing intricate details often over- 
looked by conventional neural networks. Additionally, CNNs incorporate fully con- 
nected layers for flattened feature maps, making them well-suited to identify several 


types of cyber attacks. 
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FIGURE 4.5: Structure of the CNN Model Adopted by the PPSS Frame- 
work. 


4.3.3.3 Experimental Settings: The performance evaluation of the presented PPSS- 
enabled cyber threat detection was systematically evaluated within the controlled 
environment of Google Colaboratory, using the PyTorch library and Tesla-T4 GPU 
hardware accelerator for enhanced computational efficiency. The experiments were 
conducted in two distinct aggregation stages. 

In the first stage, we instantiated localized Federated Learning (CFL) using mul- 
tiple edge Provers, often called FL servers, each equipped with dedicated data re- 
sources and clients. Several scenarios were explored within this stage, including 
variations in the number of participating clients per Prover and the integration of 
differential privacy training via DP-SGD. These explorations assessed the initial ag- 
gregation phase’s performance when dealing with limited data resources and privacy 


preservation concerns. 
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Subsequently, in the second aggregation stage, we introduced the PPSS-enabled 
decentralized FL (DFL) mechanism, leveraging the proposed PoFDL consensus mech- 
anism (Section 4.3.2.2). This stage facilitated knowledge transfer among all partici- 
pating Provers. During this phase, DFL executed a restricted federated aggregation 
exclusively incorporating validated models stored within the Learning-Chain. The 
evaluation encompassed assessing the global model’s performance, accounting for 
different data distribution characteristics (namely IID/Non-IID) and exploring data 
augmentation concerning the number of participating Provers. 

Within the differential privacy settings, we employed the Opacus library [106] for 
implementation, introducing a noise multiplier parameterized by (€,6) and imposing 
a maximum gradient norm value of C = 1.2. Table 4.3 shows the experimental con- 
figurations and learning parameters adopted in this study, while Tables 4.5 and 4.6 
provide summaries of the evaluation outcomes for both localized and decentralized 
(CFL, DFL) across diverse settings. 

Various metrics were employed to evaluate the efficiency and effectiveness of IDS 
(Intrusion Detection System) detection within CFL and DFL to assess the impact of 
security constraints on the learning process. These metrics encompassed Accuracy, 


Precision, Detection Rate, Time Complexity, and Energy Cost. 


¢ Time Complexity: This metric represents the time complexity associated with 
the convergence of the global model. It encompasses several key factors, in- 
cluding the individual client’s training time, the computational overhead of the 
model aggregation, and the time required for consensus inference under the 
PoFDL. Notably, this calculation does not incorporate the computational costs 


of secure communication and data transmission. 


e Energy Cost: The term ‘energy cost’ pertains to the energy consumption in- 
curred during training the global model. It is quantified by the following ex- 
pression [107]: 
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RN 
O(e,N, R) = y Des 1in,,} (tyne) Kwh (4.6) 


faliA 
Here, e signifies the energy power consumption across N devices, R corresponds 
to the total federated rounds, tn, represents the wall clock time of a device n; 
during round 1, e,, signifies the energy consumption of device n; during round 
r, and 1;,,,, serves as an indicator function assessing whether a device nj; is 
chosen for FL training during round r. It is important to note that n; can denote 
either a client or a server. The energy cost is expressed in Kilowatt per hour 


(Kwh) and is estimated using the Carbontracker library [108]. 


¢ Heterogeneity in data distribution : Experiments were conducted between In- 
dependent and Identically Distributed (IID) and Non-Independent and Non- 
Identically Distributed (Non-IID) data sets to evaluate training performance in 
diverse data distributions. In the IID scenario, the training dataset was parti- 
tioned into independent subsets with identical distributions, allocated to client 
groups, and overseen by a Prover node. This strategy maintained data homo- 
geneity among clients, ensuring they had a comparable dataset. In contrast, a 
label partitioning approach was implemented in the Non-IID scenario, assign- 
ing each client group a randomly selected subset of labels associated with the 
same feature vectors in the training data. This setup assumed each Prover node 
had partial knowledge of the entire set of classes within the problem. The Non- 
IID configuration introduced data heterogeneity among clients, deviating from 


the uniformity observed in the IID scenario. 


4.4 Results and Discussion 


Numerous experiments have been systematically executed to assess the efficacy of the 


proposed PPSS-enabled Decentralized Federated Learning (DFL) framework. These 
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Parameter Values 

Federated Learning on ener i 40, 80] 
Epsilon (e) [0.1, 1, 2] 

Differential Privacy Delta (0) 1.5e-5 
Gradient Norm Bound (C) 1.2 
Model Architecture CNN() (Refer to Figure 4.5) 
Learning Rate 0.01-0.001 
Optimizer Adam 

i Local Batch Size 100 

Local Epochs 1 
Loss Function CrossEntropyLoss() 
Learning Rate 0.01 


TABLE 4.3: Experimental Configurations for PPSS-enabled IDS. 


experiments were designed to investigate the influence of security constraints on the 


FL learning process. 


4.4.1 Class-Specific Performance Across Different Scenarios : 


The examination of per-class performance using various models within the PPSS 
framework reveals several key findings, Table 4.4. 

Both CFL and PPSS-enabled Decentralized FL exhibit exceptional precision and 
detection rates when identifying normal network traffic, underscoring their effective- 
ness in benign traffic detection. 

In the context of attack identification, both Non-IID and IID data training ap- 
proaches yield similar results in terms of precision and detection rates for both CFL 
and PPSS, demonstrating the efficient transfer learning enabled by federated agegre- 
gation in Non-IID data scenarios. However, utilizing Differential Privacy training via 
DP-SGD demonstrates a trade-off between privacy preservation and model perfor- 
mance, negatively impacting attack detection, especially for less-represented classes. 
PPSS excels over CFL in attack identification due to an additional aggregation phase 
using exclusively qualified models stored in the Learning-Chain, reducing training it- 


erations. Nevertheless, certain attack classes, such as Fingerprinting and Ransomware, 
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TID Non-lID 
Cissus Metrics | Precision % | Detection rate% | Precision% | Detection rate% | Support 
Settings | CFL | PPSS| CFL] PPSS | CFL | PPSS|CFL| PPSS 
No-DP | 100 | 100 | 100 TOO Too | 100 | 100 100 
pe DP | 100 | 100 | 100 100 100 | 100 | 100 100 ee 
No-DP | 72 | 74 je 89 68 | 76 | 95 88 
Backdoor DP | 72 | 72 | 89 89 40100] 92 00 ole 
a No-DP |93. | 94] 85 85 4 | 04 | 85 85 
Vulnerability_scan DP 93 93 85 77 93 93 84 84 10022 
No-DP | 100 | 100 | 100 TOO 100 | 100 | 97 100 
DDoS _ICME DP |100! 94 | 97 | 100 |100| 90 | 69 100 mad 
No-DP | 43 | 100 | 83 07 100 | 10 | 07 07 
Password DP | 14.1 36 | 01 100° | 00 | 00 | 00 00 ne 
; No-DP | 65 | 65 | 09 09 2 | 65 | 92 09 
For Scanning DP 00 | 00 | 00 00 32 | 00 | 10 00 a 
No-DP | 98 | 98 | 100 100 98 | 98 | 100 100 
Dios -UDE DP 93 | 100 | 100 99 58 | 98 | 100 100 ae? 
: No-DP | 59 | 57 | 39 37 31 | 100 | 83 15 
Uploading DP 00 | 100 | 00 00 00 | 00 | 00 00 ioe? 
No-DP | 71 | 70 |) 99 99 7i| > = 94 
PBaS TE DP.| 70 | 67 | 99 99 70 | 70 | 98 99 ate 
ats No-DP | 54] 41 | 17 90 42 | 39 | 28 100 
SOL anjschon DP 37 | 00 | 98 00 37 | 37 | 100 100 Meet 
rene No-DP | 00 | 00 | 00 00 00 [a 00 14 sik 
See DP 00 | 00 | 00 00 00 | 00 | 00 00 
No-DP 69 | 68 | 100 T00 00 | 69 | 00 TOO 
Doe TCE DP | 37168 | 100] 100 | 00 | 53 | 00] 100 1 
No-DP |/92 1000] 05 02 65 | 52 | 10 28 
xo DP | 00] 100 | 00 00 33-100 100 00 aes 
No-DP | 100 | 100 | 100 93 too | 100 | 100 93 
NIM DP | 00 | 00 | 00 00 00 | 00 | 00 00 80 
- = No-DP | 00 | 00 | 00 00 13. | 00 | 57 00 oa 
iii oe eae DP 00 | 00 | 00 00 00 | 00 | 00 00 


CFL : Centralized FL IDS; PPSS : PPSS-enabled decentralized FL IDS; 
No-DP : No differentially private training; DP : with differentially private training; 


Support : number of test samples; IID, Non-IID : data distribution 


TABLE 4.4: Per-class performance using different models. 


are prone to misclassification, emphasizing the limitations of transfer learning in cases 


of data insufficiency, notably within the FL framework. 
These results underscore the PPSS framework’s efficacy in normal traffic detection 


and attack identification while shedding light on the nuanced influence of differential 


privacy and the constraints of transfer learning in specific scenarios. 
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4.4.2 Evaluation Results of PPSS under IID Data Distribution : 


Table 4.5 presents a comparative analysis of global model accuracies within the PPSS 
framework across various configurations, including client distribution and differen- 
tial privacy settings. Test accuracy was used to assess Prover’s performance. 

In IID data distribution, the best Prover accuracy reaches 93.71% Expanding the 
Prover count to six, the best Prover accuracy attains 93.83% in the IID mode, and the 
global model accuracy reaches 93.98% with K = 20. In differential privacy settings, 
the best Prover accuracy is 93.46% with K = 40, while the global model accuracy 
achieves 93.72% at K = 80. 

Finally, with eight Provers contributing to global model updates, the best Prover 
accuracy reaches 93.86% at K = 20, and the global model accuracy attains 94.01% 
with K = 40 in the IID scenario. However, under differential privacy constraints, the 
best Prover accuracy stands at 93%, with the worst at 90.92%. 

These findings offer valuable insights into the performance of the PPSS frame- 
work across various settings, highlighting the influence of Prover count, differential 


privacy, and data distribution on model accuracy. 


4.4.3 Evaluation Results of PPSS under NonIID Data Distribution : 


Similarly, Table 4.6 compares global model accuracies within the proposed PPSS frame- 
work under NonIID Data Distribution. The best Prover accuracy attains 92.45% at a 
hyperparameter value of K = 80, while the worst Prover accuracy registers at 81.36% 
with K = 40. The global model achieves an accuracy of 92.52% at K = 80 when 
subjected to differential privacy constraints. 

Expanding the Prover count to six, the best Prover accuracy achieves 93.27% in 
the Non-IID mode, with the worst accuracy declining to 83.22% at K = 20. The global 
model exhibits an accuracy of 93.75% with K = 40. In differential privacy, the best 


Prover accuracy is 92.11% 


Chapter 4. PPSS: Privacy-Preserving Secure System for Industrial IoTs 


Te 


CFL With DP-CFL 
94 | | 94 
92 | | 92 
3 90 Z xe 90 — 
S > 
= = 
= 3 
8 3 
S g5+ S85 
— 20clients — 20clients 
— 40clients — 40clients 
— 80clients — 80clients 
0 20 40 60 80 100 200 400 600 800 1,0001,2001,4001,600 
Time(sec) Time(sec) 
PPSS with 40 clients DP-PPSS with 40 clients 
94 94 
93 | 
93 + 92 
3s 3s 
s § 90 
3 3 
x= x 
90 | | 
— 3prover — 3prover 
— 6prover — 6prover 
—— 8prover —— 8prover 
0 20 40 60 80 100 120 0 20 40 60 80 100 120 140 


Time(sec) Time(sec) 


FIGURE 4.6: Temporal Evolution of Global Model Accuracy with Varying 
Numbers of Provers and Clients in DP-CFL and DP-PPSS. 


With eight Provers contributing to global model updates, the best Prover accuracy 
attains 93.40% at K = 80, while the global model’s accuracy reaches 93.74% at K = 20 
within the Non-IID scenario. In differential privacy settings, the best Prover accuracy 
is 91.71%, with the worst at 80.84%. The global model’s performance in this scenario 
reaches an accuracy 92.63% at K = 80. 

Overall, we demonstrate that the number of participating clients had a minimal 
impact on the accuracy of the global model. Moreover, we can demonstrate that in- 


corporating a certain number of Provers can alleviate the adverse effects of differential 


privacy on model accuracy. 
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Provers Clients 1% round 15" round 
Without DP With DP Without DP With DP 
B WwW G B WwW G B WwW G B WwW G 


K=20 90.12 86.43 90.12 7544 73.21 7544 93.44 92.80 93.99 92.65 91.71 93.40 
P=3 K=40 90.03 89.71 90.03 73.21 73.21 73.21 93.71 93.35 93.85 91.66 90.97 92.26 
K=80 87.89 85.49 87.89 73.21 73.21 73.21 93.70 93.67 93.90 92.38 90.31 92.09 


K=20 92.81 78.92 92.81 78.57 36.75 78.57 93.83 93.55 93,98 89.30 82.76 91.47 
P=6 K=40 91.02 83.12 91.02 73.21 73.21 73.21 93.79 93.42 93.97 93.46 91.30 93.46 
K=80 90.94 88.13 90.94 73.21 73.21 73.21 93.80 93.67 93.97 93.14 92.01 93.72 


K=20 93.21 55.39 93.21 86.98 73.21 86.98 93.86 75.26 93.92 90.92 8246 92.23 
P=8 K=40 91.30 75.47 91.30 75.18 73.21 75.18 93.84 93.20 94.01 93.00 88.33 93.36 
K=80 93.25 81.93 93.25 73.21 73.21 73.21 93.84 93.33 93.90 91.98 90.04 92.90 


(W): Worst prover ; (G): Global model ; (B): Best prover; 


TABLE 4.5: Accuracy results of PPSS under IID Data Distribution. 


Provers Clients 1°! round 15! round 
No-DP DP No-DP DP 
B W G B WwW G B WwW G B WwW G 


K=20 87.46 78.56 87.46 73.21 73.21 73.21 92.41 8139 92.44 89.55 80.00 91.74 
P=3 K=40 87.78 79.52 87.78 73.21 73.21 73.21 92.32 81.36 92.48 89.15 80.06 89.80 
K=80 88.87 80.91 88.87 73.21 73.21 73.21 92.45 81.42 92.52 87.89 80.06 88.75 


K=20 90.69 73.86 90.69 84.60 71.29 84.60 90.79 83.22 93.28 89.59 78.52 91.83 
P=6 K=40 91.32 78.94 91.32 78.20 73.21 78.20 93.27 81.50 93.75 90.18 80.06 92.27 
K=80 86.64 80.07 86.64 80.33 73.21 80.33 93.19 81.36 93.55 92.11 80.29 93.18 


K=20 92.61 73.65 92.61 73.21 73.21 73.21 91.92 73.90 93.74 90.68 80.16 92.45 
P=8 K=40 91.06 81.25 91.06 75.03 73.21 75.03 93.35 81.51 93.69 90.94 80.05 92.38 
K=80 92.58 78.46 92.58 73.41 73.21 73.41 93.40 81.10 93.69 91.71 80.84 92.63 


(W): Worst prover ; (G): Global model ; (B): Best prover; 


TABLE 4.6: Accuracy results of PPSS under Non-IID Data Distribution. 


The findings of this study underscore the potency of the proposed PPSS frame- 
work as a viable and privacy-preserving solution for Intrusion Detection Systems in 
the realm of Industry 5.0. By skillfully merging blockchain and federated deep learn- 
ing technologies, PPSS contributes to the fortification of cyber security in the context 
of Industrial IoT, laying the groundwork for enhanced protection against cyber threats 


while maintaining the integrity of sensitive data and critical industrial operations. 
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4.4.4 Global Model Accuracy and Convergence Time 


Figure 4.6 visually represents the global model’s accuracy evolution over time, con- 
sidering varying numbers of Provers and clients. Notably, the PPSS approach demon- 
strates superior global model convergence time performance, particularly when em- 
ploying DP-SGD for training. This advantage stems from PPSS’s utilization of fewer 
training iterations and more intensive model aggregation operations than CFL. For 
instance, when deploying N clients in CFL, PPSS allocates N/P clients per Prover 
(where P is the number of Provers) for training, ensuring knowledge transfer for all 
N clients through aggregation. We can demonstrate that the computational over- 
head associated with model aggregation is significantly lower than individual client 
training. Additionally, DP training within PPSS can be improved by introducing ad- 
ditional Provers, thereby mitigating the adverse effects of Differential Privacy and 


enhancing overall training efficiency. 
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FIGURE 4.7: Comparative Analysis of Global Model Performance under 
High Privacy Regimes Employing DP-SGD. 
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FIGURE 4.8: Comparative Analysis of Global Model Training and Pri- 
vacy Loss Across Varied Noise Levels. 


4.4.5 The Impact of Differential Privacy Training via DP-SGD on 
Global Model Accuracy: 


Training using Differential Privacy Stochastic Gradient Descent (DP-SGD) to protect 
data privacy aligns with the principles of the strong composition theorem. This theo- 
rem asserts that the degree of privacy breach, quantified by standard (e, 6)-differential 
privacy, tends to grow at an approximate rate of VK under conditions of stringent pri- 
vacy requirements. Here, K represents the number of training iterations in the learn- 
ing process. Figure 4.8 and 4.7 provides an insight into the influence of differential 
privacy parameters (€, 6) on global model performance across distinct data distribu- 
tion modes IID and Non-IID. We conducted experiments using varying epsilon values 
(ec = 0.1,€ = 1,€ = 2), representing the introduced noise level while maintaining a 
fixed 6 value of 1.5e — 5 for both CFL and PPSS training methodologies. This was 
done to illustrate the trade-off between data privacy and model performance and es- 
tablish the practicality of applying DP in non-IID settings. Notably, we fixed 6 due to 


the observation that as per [109], both e and 6 have similar effects on the introduced 
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noise, with e having the more pronounced impact on training performance. 

The results indicate a notable decrease in CFL’s performance under the influence 
of DP. For instance, with e = 1, CFL’s accuracy decreased from 93.98% to 92.69%, 
and further to 91.14% for e = 0.1. Conversely, PPSS also experienced a reduction in 
accuracy, decreasing from 94.01% to 92.38% for e = 1, and to 92.18% for e = 0.1. This 
decline in accuracy is a well-recognized consequence of the introduced noise and un- 
derscores the inherent trade-off between model performance and data privacy preser- 
vation. However, our proposed PPSS exhibits greater privacy preservation than CFL, 
leveraging transfer learning and reducing the number of training iterations. Further- 
more, our experiments demonstrate the practicality of employing PPSS in non-IID 


settings, showcasing its efficacy in preserving data privacy across varying scenarios. 


4.4.6 PPSS Energy Cost: 


A g-When training without DP-SGD 1 o-3 When training with DP-SGD 
6} 4 2) | 
meal {2 15} | 
= = 
4 4 = 
g 3) if 
aa aa) 
2) | 05 : 
il. | = 
20-Clients 40-Clients 80-Clients 20-Clients 40-Clients 80-Clients 


UOGCFL UOPPSS UtConsensus with PoFDL 


FIGURE 4.9: Average Energy Consumption of PPSS-Enabled Decentral- 
ized Federated Learning (DFL) on Tesla-T4 GPU Devices. 


Figure 4.9 illustrates the average energy consumption of the proposed PPSS-enabled 
DFL framework over a single round of Federated Learning (FL). Both CFL and DFL, 


with and without the PoFDL consensus mechanism, are evaluated, considering DP- 


training (DP-SGD). 
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In the absence of DP-SGD, CFL demonstrates higher energy efficiency than PPSS 
due to additional model evaluation operations in PPSS. However, when DP-SGD was 
introduced, both CFL and PPSS experienced substantial increases in energy consump- 
tion. CFL's energy costs surge nearly tenfold, while PPSS’s increase by around three- 
fold. Consequently, PPSS proves more energy-efficient, particularly in multi-round 
FL scenarios with DP-training, thanks to cost-effective model aggregation, which re- 
duces the number of training rounds compared to CFL. Notably, the energy costs 
of model aggregation and PoFDL consensus remain significantly lower compared to 
training. 

Overall, we can demonstrate that PPSS excels in energy efficiency, especially in 
multi-round FL with DP-training, due to its efficient model aggregation and reduced 


training rounds compared to CFL. 


4.4.7 Blockchain performance and storage overhead 


Metric 3 Provers 6Provers 8 Provers 
Nb_Blocks 6 9 5 
Nb_transactions 195 226 159 
Storage (MB) 43.48 50.84 35.48 


TABLE 4.7: The Average Data Generation Rate and Storage Overhead of 
the Learning-Chain. 

We comprehensively evaluated our blockchain scheme, focusing on throughput, 
latency, and storage overhead. Latency was quantified as the between transaction tx 
submission and block confirmation. Throughput, on the other hand, which measures 
the rate at which transactions tx are confirmed, was assessed in terms of computa- 
tional load and message exchanges during block confirmation. Computationally, our 
proposed PoFDL consensus algorithm was employed for validation, utilizing model 
inference with inference costs dependent on model size and the specified ValidationSet. 
In terms of message exchanges, our PoFDL orchestrates 2( + 1) message rounds, 


where N represents the number of Validator nodes. 
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Furthermore, based on empirical data from our experiments, we examined the 
storage overhead of the resulting Learning-Chain. Table 4.7 presents the average rate 
of block data generation across various Prover nodes. We calculated the storage capac- 
ity of each block based on the data it contained, as illustrated in Figure 4.3, without 
factoring in the storage overhead from validation, which is contingent on the size of 
the employed ValidationSet. For context, the size of the trained model used in our 
experiments was approximately 0.21 MB, while the average block size stood at 6.46 
MB. Although the maximum storage overhead of the Learning-Chain for the presented 
learning task amounted to 50.84 MB, we anticipate that this figure may escalate with 
additional tasks. 

Nonetheless, these findings hold relevance for real-world permissioned blockchain 


applications operating within the fog computing layer. 


4.5 Chapter Summary 


In this chapter, we proposed a novel security framework, PPSS, designed to for- 
tify Industry 4.0/5.0 against privacy breaches and emerging cyber threats. PPSS 
encompasses two core components: a blockchain-enabled FL system and a privacy- 
preserving cyber threat detection mechanism. Within the blockchain networked model, 
PPSS facilitates cross-silo FL through the involvement of specific roles: Validator nodes 
(V) serve as trusted blockchain maintainers, Prover nodes (P-) moderate localized FL 
processes and provide efficient and precise models added to the Learning-Chain, and 
Edge-clients, which are connected to multiple P nodes, engage in differential privacy- 
enhanced model training. On the other hand, the cyber threat detection mechanism 
capitalizes on PPSS’s secure features to enhance the effectiveness, reliability, and effi- 
ciency of the Intrusion Detection System (IDS) in industrial IoT networks. 

We comprehensively evaluated PPSS’s reliability and efficiency under various sce- 
narios and experimental settings. The findings demonstrate that our proposed frame- 


work fortifies the security and integrity of the model-sharing process, rendering it 
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resilient against multiple attack scenarios. Moreover, the results notably confirm that 
the PPSS framework exhibits adept classification skills across a wide range of attacks, 
considering the unique challenges posed by industrial oT and the influence of secu- 


rity constraints on the FL learning process. 
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CHAPTER 5 


ee el 


PEDGEN-ID: FEDERATED DEEP GENERATIVE MODEL 
FOR INTRUSION DETECTION 


"The five most efficient cyber defenders are Anticipation, Education, 


Detection, Reaction, and Resilience.” 
— Stephane Nappo 


5.1 Introduction 


Drawing upon the insights acquired from the preceding chapter, blockchain technol- 
ogy offers a robust framework for facilitating secure federated learning (FL) processes 
and enhancing the validation approach through proof of learning. However, the chal- 
lenges of differential privacy training and non-IID data in cyber threat detection limit 
the effectiveness of FL-based models. Striking a balance between preserving privacy 
and maintaining model accuracy is a delicate task, especially when dealing with het- 
erogeneous data sources with distinct threat profiles. Addressing these challenges 
requires innovative techniques and strategies to adapt FL to the unique characteris- 


tics of the cybersecurity domain. 
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Furthermore, recent studies have shown that ML and DL models are vulnerable 
to zero-day attacks, which exploit unknown vulnerabilities in software or hardware. 
These attacks create unique behaviors and attack patterns, posing a challenge in de- 
tection and identification, especially in situations with limited training data [110]. 

In addition, federated ML and DL models exhibit a distinct vulnerability to ad- 
versarial attacks. These attacks, which compromise model integrity and privacy, can 
exploit vulnerabilities during the training and inference stages. They are primarily at- 
tributed to the inaccessibility of data, which further compounds the challenges faced 
in securing these models [111]. During training, adversaries employ poisoning at- 
tacks to manipulate the model’s learning process and compromise performance. Dur- 
ing inference, adversaries employ evasion attacks to deceive trained models, leading 
to incorrect cyber threat detection. 

In the preceding chapter (4.2), our proposed Privacy-Preserving Secure System 
(PPSS) addressed vulnerabilities within the training stages. It offered secure aggre- 
gation and authentication schemes to ensure the reliability of the aggregated model. 
This chapter focuses on the inference stage and aims to develop a highly efficient fed- 
erated cyber threat detection framework that identifies zero-day cyber attacks while 
preserving data privacy and enhancing adversarial robustness against evasion at- 
tacks. 

In this chapter, we introduce an innovative security framework named "Feder- 
ated Generative Intrusion Detection" or "FedGen-ID," which addresses challenges re- 
lated to privacy-preserving training and non-IID data distribution, contributing sig- 
nificantly to the robustness of cyber threat detection. Specifically, FedGen-ID is de- 
signed to enhance the security of Industrial loT networks, which are known for their 
complexity and require specialized intrusion detection solutions. This framework 
employs FL and generative AI capabilities to address privacy concerns during model 
training by facilitating collaborative model development without sharing sensitive 


raw data. Additionally, FedGen-ID recognizes the challenges posed by non-IID data 
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distribution, where individual devices may have unique threat patterns. By utiliz- 
ing generative techniques, the framework adapts to the specific data characteristics of 
each device, promoting a more consistent and effective threat detection process across 
the entire network. Moreover, in response to the evolving threat landscape, FedGen- 
ID contributes to cyber threat detection resilience by generating synthetic data that 
covers a broader range of potential attack scenarios, thereby improving its ability to 
detect previously unseen zero-day attacks, a significant concern in cybersecurity. 

The remainder sections of this chapter are organized as follows: Section 5.2 dis- 
cusses the design objectives of FedGen-ID, outlining its development goals and pur- 
poses. Section 5.3 delves into framework development, covering training procedures, 
learning objectives, and the quality of generated IDS data. Section 5.4 presents results 
and discussions, including the evaluation of augmented IDS data, the effectiveness 
of FedGen-ID in detecting adversarial attacks, and its overall performance in cyber 
threat detection. Finally, Section 5.5 summarizes the key takeaways and findings of 


FedGen-ID and its practical application in intrusion detection. 


5.2 Design Objectives of FedGen-ID 


Recently, Generative Adversarial Networks (GANs) have emerged as a promising 
approach for enhancing the robustness of optimization techniques in DL-based IDS. 
This advancement enables IDS to effectively detect and counter adversarial attacks 
without making predetermined assumptions about the capabilities of potential ad- 
versaries. Moreover, GANs can serve as a valuable tool for data augmentation, partic- 
ularly in addressing the challenges associated with imbalanced and private datasets. 
However, the practicality and effectiveness of deploying federated GANs for threat 
detection and bolstering resilience against adversarial attacks are still in their early 
stages. Additionally, assessing the generated data’s consistency, reliability, and suit- 


ability, particularly in handling IDS source data, necessitates further exploration. 
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FIGURE 5.1: FedGen-ID Design Scheme: The Proposed Federated Con- 
ditional Wasserstein Generative Adversarial Network. 


To address these challenges, we introduce an innovative security framework named 
"FedGen-ID" to enhance the efficiency of IDS, ensuring privacy protection and forti- 
fying resilience against adversarial attacks. Simultaneously, FedGen-ID seeks to opti- 
mize the sharing of security knowledge among participating entities, contributing to 
a more robust and secure cyber landscape. 

Figure 5.1 illustrates the workflow of our framework. Specifically, we employ FL 
to address privacy concerns and the computation efficiency of industrial IoT, allow- 
ing models to train on distributed data locally on user devices while only exchanging 
model updates. In addition, we propose a generative framework to overcome limited 
data, imbalanced, and non-IID data challenges and enhance adversarial resilience, al- 


lowing robust and efficient cyber threat detection. We have designed a three-model 
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system that includes a federated generative model (i.e., cCGAN-Generator), a Discrim- 
inator model (i.e., CGAN-Critic), and a Classifier model. The federated generative 
model creates a variety of artificial samples, the Discriminator (D) is trained to dif- 
ferentiate between artificially generated and real samples, and the Classifier (C) is 
trained on both original and artificially generated data to efficiently and robustly 
identify cyber threats. 

In the federation process of FedGen-ID, we suggest that both the cGAN Genera- 
tor and Classifier models be shared among clients, while the cGAN Discriminator is 
kept on the client side. This setup is driven by the need to improve the stability and 
privacy protection of distributed GAN training, which is also vulnerable to adver- 
sarial attacks. By using the cGAN Discriminator locally, clients can identify and flag 
potential adversarial attacks for further investigation. This also enhances the commu- 
nication efficiency and privacy aspects of federated learning. By sharing the Genera- 
tor, clients can locally produce a variety of artificial samples and enhance their local 
datasets, aiding in the detection of zero-day and sophisticated adversarial attacks. 

Alternatively, the global Classifier, which is shared among clients, updates that 
are also influenced by the synthetic samples generated using the global Generator 
instead of only relying on local updates contributed by individual participants. This 
enables the classifier to be trained on extensive and diverse datasets. As a result, the 
classifier can generalize well and excel in identifying a variety of attacks based on 
their features, offering crucial insights for threat analysis and response. Furthermore, 
this approach is designed to enhance the model’s overall resilience and reduce the 
potential risks associated with learning patterns induced by attackers from poisoned 


updates. 
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5.3 Framework Development and experiments 


5.3.1 Training Objectives and Algorithmic Insights 


Our training goal is to reach a balance where the generator creates a variety of re- 
alistic samples. Concurrently, the critic accurately differentiates between real and 
generated data, offering valuable feedback to the generator to produce samples that 
meet the specified condition (that is, the target class label). Our implementation of 
the Conditional-GAN leverages the power of deep convolutional neural networks 
(CNNs) to efficiently extract significant features from the conditioning input samples. 
This approach ensures that our model is well-equipped to handle a variety of scenar- 


ios and challenges: 


¢ The Discriminator model (D): As shown in Figure 5.2, this model consists of 
four convolutional layers, each with a rectified linear unit (ReLU) activation 
function. It accepts both generated and real data samples and calculates the es- 
timated Wasserstein distance between the fake and real data distributions. This 
serves as a loss function for training objectives, offering enhanced feedback to 
the generator and guiding it to produce samples that closely mimic the real data 
distribution while aligning with the specified conditions on target classes. Addi- 
tionally, D undergoes fine-tuning for predicting adversarial attacks in the post- 
GAN training phase. To facilitate this, we integrate a Dense layer that applies a 
binary cross-entropy loss with a Sigmoid function to its outputs. This quantifies 
the discrepancy between the predicted and actual values of real and generated 
data samples. By adopting this approach, we aim to bolster the Critic’s capacity 


to effectively discern and classify adversarial attacks. 


¢ The Generator model (G): As shown in Figure 5.2, this model consists of four 
transposed convolutional layers, each with batch normalization and a ReLU 
activation function. G accepts random samples drawn from a uniform latent 


space, denoted as z € IR¢ where d represents the dimension of the feature. It also 
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Symbol Explanation 


K The set of clients involved 

I The number of local iterations 

E The number of global epochs 

m The size of the local batch 

XG The learning rate for the Generator 
Xp The learning rate for the Critic 

A The penalty factor 

G The Global Generator 

D The Global Critic 

P, The distribution of noise 

D(-|:) The function of the Critic 

G(-|-) The function of the Generator 

P, The distribution of real data 

x A sample of real data 

Z A vector of noise 

y A randomly selected label 

x An interpolated sample 

VzD(%ly) The gradient of the output of the Critic with respect to ¥ 
Leen The loss of the Generator 

L disc The loss of the Critic 


TABLE 5.1: Notation for Algorithm Discussion. 


takes a condition vector of class labels, denoted as y. The goal is to generate the 
necessary labeled examples. In accordance with the distribution of real data, the 
output of the generator is passed through a Sigmoid activation function. This 


maps the generated features into normalized values ranging between 0 and 1. 


¢ The Classifier Model (C): This is a standalone CNN model that is specifically 
engineered for tasks involving multi-class classification. By using augmented 
data during the training phase, C is able to effectively grasp the complex vari- 
ations and intricacies inherent in real-world data. As a result, C exhibits a high 
degree of skill in identifying a diverse array of attack classes, thereby demon- 


strating its robustness and resilience, even in the face of adversarial attempts. 


Consequently, C demonstrates proficiency in identifying a wide range of attack 
classes, showcasing its robustness and resilience when manipulated with adver- 


sarial attempts. 


¢ Federated Learning Objective: The aim of Federated Learning (FL) is to update 
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the global generator model G, and the global Classifier C, using K local models 
from respective clients. To accomplish this, we implement an averaging algo- 


rithm, which can be expressed as follows: 


1& 1 
= = 1 
GH ZUG CHR G 6.1) 


Averaging allows for consolidating knowledge from multiple clients. This fos- 
ters collaborative learning in a distributed environment, which in turn boosts 


the performance of the model and its ability to generalize. 


Algorithm 4 outlines the federated training procedure of FedGen-ID. This pro- 
cess involves several clients concurrently training their local generators and crit- 
ics. Following this, they collectively update a global generator. The use of a 
convergence threshold can potentially decrease the training time for each client, 


especially if a certain level of convergence is reached early on. 


Algorithm 4; FedGen-ID cGAN Training. 


Input : Set of clients K, Local iterations I, global epochs E, local batch size m, 


Critic’s learning rate ap, Generator’s learning rate ac, gradient 
penalty factor A 


Output: Trained Critic D and Generator G 


1 Initialize Generator G with random weights 
2 forr =1toRdo 


N SH a e& WwW 


10 


11 


Parallel. For each client k € K fort =1to Edo 
Train Local Critic D, on client n using Algorithm 6 
Train Local Generator G,, on client n using Algorithm 5 
if distance between fake and real predictions < 0.1 then 
i: break // Convergence threshold 


end 

Update Global Generator G by averaging local generators: 
kK 

Ge BTN Gn 


return Trained Generator G to Clients 
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¢ Local Training Objective: The training goal for the client-side cGAN involves 
a process of alternating updates between the critic and generator networks. We 
incorporated the Wasserstein loss function into the objectives of both models 
[112], which serves as an approximation function that quantifies the similarity 
between the distributions of real and generated data, based on the amount of 
movement required to transform one distribution into the other. The aim is to 
stop the generator from falling into a single mode and to ensure that the samples 


it generates are realistic. The definition of the Wasserstein loss is as follows: 
min max (Ex~p,[D(xly)] — Ez~p,[D(G(zly))]) (5.2) 


where P- represents the noise distribution and generates synthetic data samples. 
D(.|-) the critic function, also known as the critic, which evaluates and distin- 
guishes between real data samples x drawn from the real data distribution P, 
and the generated samples produced by the generator function G(.-|-). 

In simple terms, the critic’s goal is to differentiate between a variety of real data 
and fake data, given the labels. Concurrently, the generator aims to deceive 
the critic by generating data that is as realistic as possible, based on the target 
labels. To enhance the stability of cCGAN, we incorporated the gradient penalty 
(GP) into the previous loss equation. This serves as an approximation for enforc- 
ing the 1-Lipschitz continuity, ensuring that the critic’s gradient norm is almost 


always one. The implementation of GP is as follows: 


n 


nee (s2 cae - ye [IVD (Silva) Ilo -_ 1") (5.3) 


i=1 

Where, A is the hyper-parameter controlling the strength of the gradient penalty, 
X; is a sample randomly interpolated between real data x; and generated data 
G(zjly;), and Vx,D(<;|y;). It represents the gradient of the critic’s output con- 
cerning ¥j. 


Furthermore, for better performance, the critic independently applies the binary 
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FIGURE 5.2: The Proposed Three-Model Approach for Efficient and Ro- 
bust Cyber Threat Detection. 


cross-entropy loss expressed as: 


i=1 


min (;  flog(D(xi|1)) +log(1 — Dial) 6.4) 


Where D(.|1) and D(.|0) represent the D’s prediction for the input data sample 
as real or fake, respectively, compared to the ground truth values (0,1). 
On the other hand, for updating the classifier for multi-class classification, the 


objective can be formulated as: 
1 J C 
min | —— eS ye Vic log(C(x;)) (5.5) 
. " i=1c=1 


where C is the classifier, x; is the augmented data sample, y;, is the ground 


truth label for class c, and C(x;) is the predicted probability distribution over 


the classes. 
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Algorithm 5: Training of FedGen-ID Generator. 


Input : Number of local iterations I, size of local batch m, Generator’s 
learning rate ac, penalty factor A 

Output: The trained Generator G 

1 Download Generator G from the Server 

2 fori=1tolI do 

3 Draw m noise vectors {Z1,Z2,..-,Zm} from the noise distribution P- 

4 Obtain m random labels {y1, y2,...,Ym} from the clients 

5 | Create synthetic samples: 

6 | {G(zily1),G(zalya),-.-»G(zmlym)} 

7 | Determine the generator loss using the Wasserstein loss: 

8 Leen = 9 Lie D(G(zilyi)) 

9 Adjust the weights of the Generator using gradient descent: 


10 | G—G—acg: VgLlgen 


ui. return The trained Generator G 


Algorithm 6: Training of FedGen-ID Critic. 


Input : Number of local iterations I, size of local batch m, Critic’s learning 
rate ap, penalty factor A 

Output: The trained Critic D 

1 Start by initializing the Critic D with weights chosen randomly 

2 fori=1toldo 

3 Collect m real data samples {x1,x2,...,Xm} from the clients 

4 Draw m noise vectors {Z1,Z2,..-,Zm} from a uniform distribution P- 

5 Obtain m random labels {y}, y2,...,Ym} from the clients 

6 | Create synthetic samples: {G(z1|y1), G(z2|y2),..-,G(Zm|Ym) } 

7 Draw m random interpolation factors {a1,02,...,&m} from a uniform 


distribution 
s | Calculate interpolated samples: 
9 {¥1, £2)... ¥m} = ans + (1 — ai)G(Zilyi) 
10 | Determine the critic loss using the Wasserstein loss with gradient penalty: 
‘ 2 
4 Leise = i Lins [D(xilyi) — D(G(zilyi)) ++ (IVs,D (ly) lp - 1)" 
122 | Adjust the weights of the Critic using gradient descent: 


13 D¢ D-ap- VoLdisc 


5.3.2 FedGen-ID: Quality of Generated IDS Data 


The data produced by the conditional GAN necessitates further processing and val- 


idation to comply with the constraints and traffic feature boundaries of the original 
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Algorithm 7: Refinement of Generated Data. 
Input : Original dataset O with n instances and d attributes, synthetic dataset 
S with the same dimensions as O and d attributes, indices R of 
attributes that need correction for out-of-range values, indices B of 
binary attributes that need correction for incorrect values, indices C 
of one-hot encoded attributes that need correction for incorrect values 
Output: Refined synthetic dataset S’ with the same dimensions as S 
1 Procedure RefineData (O,S,R, B,C) 
2 Sos // Make a copy of synthetic data 
3 fori © Rdo 


// Correct out-of-range values 


4 Omin,i — min(O. ;) 
5 Omax,i *— max(O. ;) 
6 Z St <= max(min(S. j, Cuaet)s Umin,i) 


7 fori < Bdo 


// Correct binary values 


! ; ; ; ‘ 
8 Bs aconieet? — (oe - 0) /\ (554 x 1) // Identify incorrect values 
9 ee <— Sivaiereced'| // Round incorrect values to closest integer 
! rel] . ! . 
a corrected,i + Scorrected,i S neomectd ae iA (-S incorrect,i ) // Substitute 


incorrect values with corrected values 


a | ore eCdo 


// Correct one-hot encoded values 


12 h; = argmax(S. ;) // Determine index of maximum value 
13 Si, — eh, // Set all values except the maximum to 0 
14 return S’ // Return refined synthetic data 


data. Algorithm 7 is designed to verify the accuracy of generated data that might 
contain errors or inconsistencies, especially in certain traffic feature categories. We 
take into account features that have out-of-range values, incorrect values for binary 
features, and incorrect values for one-hot encoded features. For features that are out- 
of-range, we identify instances where the synthetic data exceeds the valid range de- 
fined by the original data and adjust their values to fit within this range. For binary 
features, we correct these values by rounding them to the nearest integer. Lastly, for 
one-hot encoded features, the algorithm identifies the index of the highest value in 
the one-hot encoded feature vector and sets all other values to 0. This strategy ef- 


fectively guides researchers in addressing errors and inconsistencies in synthetic data 
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FIGURE 5.3: Example of Data Refinement for Generated Network Traffic 
Data. 


produced by GANSs for network traffic data, facilitating the creation of more consis- 
tent and reliable synthetic datasets for network-based cyber threat detection. Figure 
5.3 provides a visual representation of how artificial samples are refined based on 
chosen features. For example, a feature such as ‘matt.conflags’ is expected to only 
take on the values of 0 or 1. Similarly, a feature like ‘Http.Request’ should fall into 
one of six categories. It’s also important to note that features that are within the range 


of actual examples are kept as they are. 


5.3.3 Experimental Settings 


We carried out the experiments of our proposed FedGen-ID security framework on 
Google Collaboratory, utilizing PyTorch and Tesla-T4 GPU accelerators. The partic- 
ipating clients were provided with non-iid datasets, as depicted in Figure. 5.5. We 
initially established the federated cGAN training. Subsequently, we leverage the fed- 
erated generative model to augment the training of the federated classifier model by 
supplying it with augmented data. Following this, we implemented DP training for 


the federated classifier model and evaluated the extent to which the augmented data 
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could alleviate the negative effects of DP. This approach ensures privacy while effec- 
tively detecting and identifying zero-day cyber threats. 

It is important to note that each client retains its own critic model, which serves as 
a discriminator for identifying adversarial examples. The specifics of the experimen- 
tal settings and learning parameters used in this study are detailed in Table 5.2. To 
evaluate the impact of security constraints on the learning process, we have employed 
a variety of metrics to evaluate both detection efficiency and effectiveness and gain 
insights into the performance and robustness capabilities of our proposed framework 
for detecting zero-day cyber threats. 

In addition, we aim to explore the effects of security constraints, including dis- 


tributed learning and differential privacy training, on the effectiveness of our frame- 


work. 
Parameter Values 
cGAN Generator Refer to 5.2 
cGAN Critic Refer to 5.2 
Federated cGAN foe eau aes 10 
Critic repeats forone epoch 2 
Learning rate 0.0002 
Local Batch_size 32 
Global rounds 5 
Classifier CNN 15-class 
hs Local Batch_size 64 
Federated Classifier Glebal rounds 15 
Learning rate 0.001 
Epsilon (e) 1 
Differential privacy Delta (0) 1.5e-5 
Gradient norm bound (C) 1.2 
. Optimizer Adam 


TABLE 5.2: Experimental settings for FedGen-ID. 


5.3.3.1 Dataset Processing: This framework is also the new Edge-IIoTset [33] pre- 
viously discussed in Section 4.3.3.1, which exhibits characteristics of both imbalanced 
and non-IID. This dataset comprises fourteen labeled network attacks. The initial dis- 


tribution of the dataset after the holdout split is depicted in Table 5.3. To emulate data 
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FIGURE 5.4: Flowchart of FedGen-ID Framework Training, Aggregation, 
and Evaluation. 


heterogeneity, we partitioned the training set into non-IID partitions and distributed 
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Classes Original Train Count Original Test Count 
Normal 1046926 323129 
Backdoor 19890 4972 
Vulnerability_scanner 40088 10022 
DDoS_ICMP 93149 23287 
Password 40122 10031 
Port_Scanning 18051 4513 
DDoS_UDP 88027 22007 
Uploading 30107 7527 
DDoS_HTTP 39929 9982 
SQL_injection 40962 10241 
Ransomware 8740 2185 
DDoS_TCP 40050 10012 
XSS 12732 3183 
MITM 320 80 
Fingerprinting 801 200 


TABLE 5.3: Edge-IIoTset Data Distribution. 


them among ten clients. We employed a label partition method for this purpose, en- 
suring that each client receives a random subset of labels with the identical feature 
vector of the training data. This approach is based on the assumption that each client 
possesses partial knowledge of the total classes involved in the problem, as shown in 


Figure 5.5. 


5.4 Results and Discussion 


5.4.1 Evaluating Convergence of Federated cGAN Training 


A comprehensive set of experiments were carried out to determine the optimal hy- 
perparameter configuration for the stability of the training process in our proposed 
federated cGAN scheme. Our results indicate that the stability improves when multi- 
ple local epochs are used with a fewer number of federated rounds. Figures 5.9 show- 
cases the local training loss of the Federated cGAN, which uses the Wasserstein dis- 
tance with gradient penalty (Wass-GP), reported at specific training steps. The Critic 
loss, which is directly related to the Wasserstein distance in both cGAN models, rep- 


resents an approximation of the negative of the Wasserstein distance. As shown, the 
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FIGURE 5.5: Non-IID Data Distribution. 


Wass-GP, unlike standard loss functions, is unbounded and can produce any value. 
This characteristic enhances the critic without encountering the vanishing gradient 
issue. 

Interestingly, we observe that the critic’s loss begins at a relatively high value and 
gradually diminishes over time, indicating an enhancement in the Critic’s capability 
to differentiate between real and generated samples. On the other hand, the generator 
loss starts at a lower value and slightly escalates over time. This can be attributed 
to the improved performance of the Critic, which sets a more challenging adversarial 
goal for the generator. Importantly, as the training advances, a pattern of convergence 
becomes apparent, where the losses associated with both the generator and Critic tend 


to converge towards each other. 
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5.4.2 Evaluating FedGen-ID for Adversarial Attack Detection 


In order to reinforce the robustness and adaptability of our framework against the 
constantly evolving adversarial attacks, we have enhanced the ability of local crit- 
ics to detect adversarial examples by adjusting their decision threshold through the 


application of the Sigmoid function. 
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FIGURE 5.6: Loss history of fine-tuning client critics for adversarial attack 
detection. 


It’s worth noting that the Critic models were trained using the Wasserstein loss, 
which maximizes the distance between real and fake inputs. Therefore, if we ap- 
plied an activation function, we could predict adversarial examples and evaluate 
them against a corresponding ground truth value. However, our cGAN-Critic mod- 
els showed limitations in generalizing to other attack techniques, thereby reducing 
the practicality of the defense mechanism in real-world scenarios. To address this, we 
further fine-tuned the Critic models using data from the global generator and data 
from other advanced attack techniques to enhance adversarial variety. 

More specifically, we added a linear layer and trained it using authentic data from 
the clients’ datasets, combined with data from the global generator and more sophis- 


ticated attack methods, including FGSM adversarial attacks. Figure 5.6 depicts the 
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fine-tuning history of the Clients Critic over 15 epochs. The results are reported in 
Table 5.4. 


5.4.3 Evaluating FedGen-ID for Data Augmentation 


Notably, our proposed federated generative model (FGM) approach incorporates class- 
conditioned labels, which, although not immune to ensuring label accuracy, signifi- 
cantly contributes to enhancing data diversity. Our investigation produced a dataset 
comprising 50,000 instances for each distinct attack class. However, following the 
application of our data refinement methodology, which introduces marginal modi- 
fications to feature values, a mismatch was detected between the initially specified 


target classes and the resulting predicted labels upon employing the FedID classifier. 
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FIGURE 5.7: An Examination of Validation Accuracy in FedGen-ID Com- 
pared to Standalone FedID with and without Differential Privacy Train- 
ing on the Original Test Data. 


Figure 5.8 demonstrates the class distribution of generated Dataset. The results 
demonstrate that the approach successfully captures the underlying patterns and fea- 
tures of classes such as Normal, Password, Fingerprinting, XSS, and Portscanning as 
indicated by their relatively high sample counts. 

It is worth mentioning that our approach utilizes conditioning during the gen- 


eration process to label the generated samples by employing specific class targets; 
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FIGURE 5.8: Confusion Matrix Depicting the Class Distribution of Gen- 
erated Traffic, Labeled Using the FedID Classifier. 


we observed numerous misalignments between the generated data and the intended 
ground truth target class. This analysis is conducted using a pre-trained classifier. In 
the scope of our research, we proceed with this labeling technique using the FedID 
classifier with 96% accuracy on the original train data to rectify the labeling discrep- 
ancies. However, it is worth noting that techniques such as self-supervised learning 
could be investigated in prospective studies. 

These results highlight the Wasserstein conditional GAN’s ability to generate syn- 
thetic data that faithfully exhibits the distinct characteristics associated with each 
class. However, it is worth noting that certain classes, including Backdoor, HTTP, and 
DDoS_UDP, exhibit relatively low counts, suggesting the presence of fewer distinc- 
tive patterns or features, posing challenges for an accurate generation. Nevertheless, 
by integrating these generated samples into the local training process of participating 
clients, we aim to enhance robustness and classification efficiency against adversarial 
and zero-day cyber attacks. 

Table 5.4 showcases the effectiveness of our proposed individual detector against 
three different adversarial attacks. The table reveals that the performance of individ- 


ual critics varies against different adversarial attacks. Some clients have shown high 
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Attacks Clients Accuracy % DR% FPR % 
Worst client : 2 92.05 98.64 15.76 
FeoM Best client 8 96.74 97.29 04.57 
BIM Worst client : 10 92.97 98.96 18.52 
Best client : 9 98.04 98.92 04.01 
D Fool Worst client : 2 92.12 98.82 15.76 
ceproo" Best client: 9 98.79 100 04.01 


TABLE 5.4: Assessing the effectiveness of our proposed individual detec- 
tor compared to three different adversarial attacks. 


accuracy and detection rates while maintaining relatively low false positive rates. For 
example, under the FGSM attack, the most effective client (Client 8) achieved a de- 
tection rate of 97.29% and a false positive rate of only 4.57%. On the other hand, the 
least effective client (Client 2) had a false positive rate of 15.76%. Across all evaluated 
attacks, Client 9 consistently performed the best, demonstrating high detection rates 
and accuracy. 

These results highlight the capability of our proposed method of refining individ- 
ual critics to effectively identify complex adversarial attacks, rather than depending 
on a single model for defense against all attacks. Our approach stands out from tra- 
ditional methods as it employs an additional classifier detection model to examine 
adversarial inputs that went undetected, thereby improving the overall robustness of 
the system. This approach underscores the importance of diversity and adaptability 
in building resilient defense mechanisms against adversarial attacks.. 

In evaluating the computational efficiency of our proposed federated generative 
framework (FedGen-ID), Figure 5.7 offers a comparative analysis of the training accu- 
racy between FedGen-ID and FedID over time. This comparison considers both sce- 
narios - with and without Differential Privacy (DP), and uses the Original Real-TestSet 
for the evaluation. The figure clearly shows that FedGen-ID achieves performance 
levels nearly equivalent to FedID without DP. Interestingly, FedGen-ID outperforms 
FedID in terms of performance when DP training conditions are implemented. This 


underscores the effectiveness and efficiency of our proposed FedGen-ID framework. 
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FIGURE 5.9: Training Local cGAN: Loss versus Training Steps. 


Additionally, our analysis reveals that incorporating DP incurs a training over- 


head for both frameworks, with FedGen-ID displaying a comparatively lower in- 


crease in computational time attributable to its data augmentation approach. We 
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can demonstrate that our proposed FedGen-ID exhibits significantly enhanced cost- 
effectiveness in contrast to the implementation of DP training when considering pri- 
vacy preservation. Even when both strategies boost privacy protection, FedGen- 
ID continues to exhibit efficacy. This further substantiates the robustness and effi- 
ciency of our proposed framework, emphasizing its practical applicability in privacy- 
sensitive settings. 

Moreover, our examination indicates that the inclusion of DP results in a train- 
ing overhead for both frameworks. However, FedGen-ID shows a relatively smaller 
increase in computational time, attributable to its data augmentation strategy. It’s evi- 
dent that our proposed FedGen-ID demonstrates considerably improved cost-effectiveness 
compared to the implementation of DP training, particularly in terms of privacy 
preservation. Even when both methods enhance privacy protection, FedGen-ID main- 
tains its effectiveness. This further validates the robustness and efficiency of our pro- 
posed framework, underlining its practical use in settings where privacy is a priority. 

These results underscore the potential of FedGen-ID as a valuable tool for privacy- 
preserving FL in security-sensitive contexts. However, while FedGen-ID demon- 
strates promising results in accuracy, resilience, and generalization, there are specific 
classes where further refinement may enhance precision and recall. 

Indeed, these findings highlight the potential of FedGen-ID as a valuable asset 
for privacy-preserving FL in contexts that are sensitive to security. However, while 
FedGen-ID shows encouraging outcomes in terms of accuracy, resilience, and general- 
ization, there are certain areas where additional refinement could potentially improve 
precision and recall. This suggests that while FedGen-ID is a robust and efficient tool, 
there is always room for further enhancement to optimize its performance in various 


scenarios. 


5.4.4 Evaluating FedGen-ID for Zero-day Attack Detection 


Similarly, to determine the detection accuracy and resilience against zero-day threats. 


Our non-IID setup mimics the unpredictability of these threats by omitting certain 


Chapter 5. FedGen-ID: Federated Deep Generative Model for Intrusion Detectioril08 


attack classes from the datasets of specific clients. Furthermore, we simulate these at- 
tacks by augmenting the TestSet to incorporate variations and new instances, utilizing 
the global generator. We maintained the integrity of test by ensuring that there were 
no duplicate records. The generated samples of zero-day attacks were labeled with 
their corresponding known attack labels. Figure 5.10 showcases the performance re- 


sults in detecting and identifying these Zero-day attacks. Furthermore, we evaluated 
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FIGURE 5.10: Classification Analysis: Visualizing Zero-Day attack detec- 
tion. 


the combined original TestSet and the generated samples of zero-day attacks, which 
were referred to as the "Augmented TestSet". Figure 5.11 presents a comparative anal- 
ysis between FedGen-ID and FedID, taking into account the impact of DP training 
on the classification accuracy of both frameworks across both test sets. The findings 
underscore the potential of our proposed FedGen-ID and its capacity to uphold com- 
petitive accuracy levels across various TestSets. Specifically, FedGen-ID achieves an 
accuracy of 92.72% without DP-training and 92.47% with DP-training on the origi- 
nal TestSet. Even with a minor decrease in accuracy with DP-training, FedGen-ID 
maintains impressive performance. Notably, it surpasses FedID in the "Augmented- 
TestSet" by 14% without DP training and by 10% with DP training. These results 
further highlight the robustness and efficiency of FedGen-ID, particularly in privacy- 
sensitive settings. These findins advocate for our FedGen-ID as a robust and adapt- 


able privacy-preserving IDS capable of tackling the ever-evolving challenges of cyber 


Chapter 5. FedGen-ID: Federated Deep Generative Model for Intrusion Detectiorl09 


Original Real-TestSet Augmented-TestSet 
100 -——-= a ; S 2 
6 @ A 7 100} $8 = ; 
a OR sii rox re) xt 
95 | R si oN 2 si 7 oi rN i 
[= 
rox 90 + = 4 
=< 90} | = . 8 
pS ~ ioe) ee) 
QU is) L_ N N | 
5 85) js 80 
5 2 
9 rs} 
< go ae YE | 
75 |- 7 60 | Z 
70 50 
Without DP-training With DP-training Without DP-training With DP-training 
OCentralized learning l tUsing FedID [ UUsing FedGen-ID 


FIGURE 5.11: Comparative Analysis of Cyber Threat Detection Perfor- 

mance and Robustness using our proposed Federated Generative Intru- 

sion Detection (FedGen-ID) and Standalone Federated Intrusion Detec- 
tion (FedID). 


threat detection in privacy-sensitive loT environments. 


5.4.5 Overall Evaluation of FedGen-ID for Cyber Attack Detection 


Table 5.5 showcases the performance results for each class to evaluate the effective- 
ness of FedGen-ID in improving the precision and recall of detecting and identifying 
various cyber threats, as well as its robustness against zero-day attacks. Both FedID 
and FedGen-ID achieve high precision and recall without DP in detecting ‘Normal’ 
traffic for threat detection. With DP, there is a slight decrease in precision, while recall 
remains competitive across all experiments. 
When it comes to specific attack categories, the precision and recall scores of FedGen- 

ID and FedID show significant differences across various privacy settings. In scenar- 
ios such as ’MITM,’ ’DDoS_UDP, ’DDoS_ICMP, and ’Password,’ FedGen-ID achieves 
performance levels that are nearly equivalent to or better than FedID without DP. 
When both privacy-enhancing strategies are combined, FedGen-ID exhibits strong 
performance, particularly in scenarios involving zero-day attacks. This further em- 


phasizes the robustness and adaptability of our proposed framework. 
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Original TestSet Augmented TestSet 
Cla: Metrics Precision % Detection rate% Precision% Detection rate% 
eas Settings | FedID | FedGen-ID | FedID | FedGen-ID | FedID | FedGen-ID | FedID | FedGen-ID 


Normal 


Backdoor 


Vulnerability_scan 


DDoS_ICMP 
Password 
Port_Scanning 
DDoS_UDP 
Uploading 
DDoS_HTTP 
SQL_injection ’ a 
0.00 0.11 
Rameonnaate 0.00 0.00 0.06 
DDoS_TCP 
Xxss 0.02 
0.00 y2 0.00 
mim | NE oo 
i can ttites No-DP | 0.00 0.00 0.00 
getp 8 DP 0.00 0.00 0.00 0.00 0.12 0.00 0.00 0.00 


FedID: Federated Intrusion detection; FedGen-ID : Federated Generative Intrusion detection; 
No-DP : No differentially private training; DP : with differentially private training. 


TABLE 5.5: Evaluating performance across individual classes using vari- 
ous assessment criteria. 


Figure 5.12 visually presents the confusion matrices for various settings, showcas- 
ing the performance of the FedGen-ID framework on the augmented-TestSet. These 
results offer valuable insights into the model’s ability to classify different types of 
attacks and normal traffic instances. 

Overall, our proposed FedGen-ID framework presented a novel contribution to 
federated generative intrusion detection. We demonstrated its effectiveness in tack- 
ling the challenges associated with preserving privacy, defending against zero-day 
and adversarial attacks, and confronting emerging cyber threats in industrial loT ap- 


plications. This underscores the potential of FedGen-ID as a robust and efficient tool 
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(B) FedGen-ID on original Test-Set 
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FIGURE 5.12: Classification Analysis: Visualizing Confusion Matrices. 


for enhancing security in the rapidly evolving field of IoT. Its adaptability and re- 
silience make it particularly suited for real-world applications where privacy and se- 


curity are paramount. 


5.5 Chapter Summary 


This chapter introduces a three-model paradigm (FedGen-ID) to enhance privacy 
preservation and resilience against evolving cyber threats. The Federated Generative 
Model’s primary model employs the GAN approach for data augmentation. Only 
generator model updates are exchanged among clients, and we introduce a novel loss 
function to diversify generated samples, addressing challenges posed by imbalanced 


and distributed data. 
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We also implement a data refinement method to align generated data with prede- 
fined constraints. The second model refines local Critics to enhance resilience, while 
the third model is a cyber threat classifier. We evaluate our FedGen-ID framework 
using an industrial cybersecurity dataset, demonstrating its efficiency and robustness 
in detection accuracy while maintaining data privacy. The results indicate that our 
proposed data augmentation method supports a synthetically enhanced federated 
learning scheme, improving detection efficiency and resilience against zero-day at- 


tacks. 
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CHAPTER 6 


ST ) 


CONCLUSION AND FUTURE WORK 


This chapter introduces the principal findings derived from the thesis and offers rec- 
ommendations for prospective research endeavors as inspiration for researchers to 


embark upon novel academic investigations. 


6.1 Conclusion 


In this thesis, a privacy-preserving security framework was proposed for cyber threat 
detection in the Industrial IoT infrastructure. This framework addresses the security 
requirements and challenges posed by Industrial IoT ecosystems and proposes new, 
effective, and robust detection strategies to secure Industry 5.0 from emerging cyber 
threats. 

First, we introduced a cost-effective and efficient federated learning methodol- 
ogy for malware detection targeting, with a primary focus on privacy preservation, 
computation cost, and detection efficiency. The results demonstrated the efficiency 
and effectiveness of this methodology using a CNN approach in comparison with 
conventional centralized methods in terms of computation cost and privacy protec- 


tion. However, the detection efficiency proved inadequate when considering only 
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network-based statistical features. Furthermore, the inherent insecurity of the FL pro- 
cess, which encompasses challenges such as establishing a reliable framework for se- 
cure aggregation and validation of uploaded updates, addressing issues related to 
system unreliability, and ensuring the safeguarding of privacy during the model up- 
loading process, has been a critical area of concern. 

In the second part of our study, we introduced a privacy-preserving secure frame- 
work, PPSS, which seamlessly integrates blockchain technology and the energy-efficient 
Proof-of-Learning consensus protocol. This framework is designed to enhance the se- 
curity and reliability of the FL process while promoting transparency, especially in 
resource-constrained industrial systems. We thoroughly evaluated the effectiveness 
of PPSS using a recent dataset focused on industrial cybersecurity (Edge-IIoT). Our 
evaluation encompassed key metrics such as detection rate, accuracy, computational 
efficiency, and energy consumption. The results highlight that PPSS substantially en- 
hances the security and integrity of the model-sharing process, rendering it resilient to 
vulnerabilities and potential exploit scenarios. Furthermore, PPSS demonstrated im- 
pressive identification capabilities against a variety of attacks while effectively man- 
aging security constraints within the FL learning process. 

Lastly, the third contribution extended the scope of FL by employing federated 
generative adversarial networks, FedGen-ID, and data augmentation techniques to 
develop a robust cyber threat detection framework for Industrial IoTs. FedGen-ID 
employs two approaches: the FL-based GAN approach and the FDL approach. It uses 
the GAN approach with a Wasserstein loss function to produce high-quality and di- 
versified IDS data, addressing challenges posed by imbalanced and distributed data. 
FedGen also refines local GAN Critics to enhance resilience against adversarial at- 
tacks. In the second approach, FedGen-ID uses GAN-based augmented data to sup- 
port FDL, improving detection efficiency and resilience against zero-day attacks. The 
results demonstrate that our proposed data augmentation method supports a syn- 
thetically enhanced federated learning scheme, improving detection efficiency and 


resilience against zero-day attacks. 


Chapter 6. Conclusion and Future Work 115 


In summary, we effectively balanced the demands of emerging technologies with 
the security and privacy concerns of IoT-enabled industrial infrastructure. These col- 
lective efforts underscore the importance of innovative detection strategies for coun- 
tering large-scale malware attacks and ensuring the resilience of critical industrial 
systems against evolving cyber threats. The findings offer valuable insights to in- 
dustry stakeholders, cybersecurity professionals, and researchers, enabling them to 


maintain the stability and security of Industry 5.0 operations. 


6.2 Future work 


6.2.1 Deployment of Privacy-Preserving Secure System 


Our future research agenda includes the implementation of the proposed Privacy- 
Preserving secure framework (PPSS), broadening the scope of applicability and ro- 
bustness testing on tangible IoT devices such as Raspberry Pi and other open-source 
platforms. Moreover, we aim to explore alternative privacy protection measures, 
such as homomorphic encryption, alongside other unsupervised learning methodolo- 
gies. This diversified exploration promises to enrich our understanding of privacy- 
preserving collaborative learning and strengthen the framework’s versatility in ac- 


commodating various privacy paradigms. 


6.2.2 Empowering Federated Learning with Generative-Al 


Future studies will focus on improving our federated generative framework using 
more promising approaches, such as ensemble learning for collective decision-making, 
and self-supervised learning methodologies for enhancing generative model capabil- 


ities. 
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